杰哥的{运维，编程，调板子}小笔记

ARM Neoverse V3 (代号 Poseidon) 微架构评测

Sat, 13 Jun 2026 00:00:00 +0000

ARM Neoverse V3 (代号 Poseidon) 微架构评测¶

背景¶

使用 ARM Neoverse V3 核心的 AWS Graviton 5 最近上线了，相比之前的 Neoverse V2 应该有一些改进，所以测试一下这个微架构在各个方面的表现。

官方信息¶

ARM 关于 Neoverse V3 微架构有如下公开信息：

Neoverse V3 与 Cortex X4 高度相似，这里也列出 Cortex X4 的相关信息：

下面分模块记录官方信息和实测结果。官方信息与实测结果一致的数据会加粗。

现有评测¶

网上已经有 Neoverse V3 微架构的评测和分析，建议阅读：

下面分各个模块分别记录官方提供的信息，以及实测的结果。读者可以对照已有的第三方评测理解。官方信息与实测结果一致的数据会加粗。

Benchmark¶

Neoverse V3 (AWS Graviton 5) 的性能测试结果见 SPEC。

前端¶

L1 ICache¶

官方信息：64KB, 4-way set associative, VIPT behaving as PIPT, 64B cacheline, PLRU replacement policy

测试 L1 ICache 容量，构造一个具有巨大指令 footprint 的循环，由大量 nop 和最后的分支指令组成，观察不同 footprint 下的 IPC：

起始 IPC 为 9。Neoverse V3 删除了 MOP Cache，不像 Neoverse V2 那样可以把两条 NOP 合并为一条 MOP 来提高 IPC。虽然是 10-wide Decode，IPC 只能到 9，应该是遇到了其他瓶颈。

超出 64KB L1 ICache 后，IPC 降到 4，说明 L2 Cache 可以提供每周期 16 字节的取指带宽。

L1 ICache 和 Neoverse V2 相同，只是去掉了 MOP Cache，增加了 Decode 宽度。

L1 ITLB¶

官方信息：Caches entries at the 4KB, 16KB, 64KB, or 2MB granularity, Fully associative, 48 entries

构造一组 B 指令，分布在不同的 page 上，让 ITLB 成为瓶颈：

48 Page 处出现拐点，对应 48 项的 L1 ITLB 容量。之后性能降到 7 CPI，对应 L2 Unified TLB 的延迟。

进一步增加 Page 数量，大约 1000 个页的时候，耗时从 7 cycle 逐渐上升：

L2 Unified TLB 一共 2048 个 Entry，猜测 ITLB 能使用的 L2 TLB 容量只有一半，也就是 1024 项。超出后需要 Page Table Walker 做地址翻译。测试时要注意避免 Huge Page 的影响。

L1 ITLB 和 Neoverse V2 行为相同。

Decode¶

官方信息：10-wide Decode

Neoverse V3 只有一个 Decode 路径，从 ICache 过来，不再有 Neoverse V2 的 MOP Cache。

Return Stack¶

Return Stack 记录最近的函数调用链，call 时压栈，return 时弹栈，用于预测 return 指令的目的地址。构造不同深度的调用链，发现 Neoverse V3 的 Return Stack 深度为 32：

大小和 Neoverse V2 相同。

BTB¶

构造大量 B 指令，BTB 需要记录它们的目的地址。分支数量超过 BTB 容量时，性能就会下降。将 B 指令紧密放置（每 4 字节一条）：

1024 条分支之前 CPI 约 0.5，说明 Neoverse V3 继承了 Neoverse V2 的 two taken 能力。之后到 8192 条分支之前 CPI 约 1，到 16384 条分支时 CPI 为 2，到 32768 条分支时 CPI 为 6。

性能曲线和 Neoverse V2 相同。Neoverse V2 的 BTB 官方描述是：

10x larger nanoBTB（注：Neoverse V1 的 nanoBTB 是 96 entry）
Split main BTB into two levels with 50% more entries（注：Neoverse V1 的 main BTB 是 8K entry）

据此推算 Neoverse V2 和 V3 有相同的三级 BTB 结构：

Nano BTB: 1024 branches, two taken, 1 cycle latency
L1 Main BTB: 8192 branches, two taken, 2 cycle latency
L2 Main BTB: 4096 branches (?)

主要疑点是 16384 条分支时如何实现 CPI 2，目前还缺少解释。

Conditional Branch Prediction¶

利用我们的逆向方法，观察分支地址对 PHR 的贡献：

B[2-3]: shift 263 次
B[4-5]: shift 262 次
B[6-7,12-13]: shift 261 次
B[8-9,14-15]: shift 260 次
B[10-11,16-17]: shift 259 次

分支目的地址的贡献：

T[7-8]: shift 263 次
T[5,9-10]: shift 262 次
T[2,11]: shift 261 次
T[3-4]: shift 260 次
T[6]: shift 259 次

找到对应位的异或关系后，推断出 PHR 共有 264*2=528 位，每个 taken branch 左移 2 位，footprint 从低位到高位如下：

B[2] xor T[7]
B[3] xor T[8]
B[4] xor T[9]
B[5] xor T[10]
B[6] xor B[12] xor T[11]
B[7] xor B[13] xor T[2]
B[8] xor B[14] xor T[3]
B[9] xor B[15] xor T[4]
B[10] xor B[16]
B[11] xor B[17] xor T[6]

其中 T[5] 没有找到异或关系。和 Neoverse V2 的 PHR 构造只有很小的区别：Neoverse V2 中，T[5] shift 次数是 259。

后端¶

Dispatch¶

官方信息：up to 10 MOPs per cycle and up to 20 uOPs per cycle, with the following limitations on the number of µOPs of each type that may be simultaneously dispatched:

Up to 4 µOPs utilizing the S or B pipelines
Up to 4 µOPs utilizing the M pipelines
Up to 2 µOPs utilizing the M0 pipelines
Up to 2 µOPs utilizing the V0 pipeline
Up to 2 µOPs utilizing the V1 pipeline
Up to 6 µOPs utilizing the L pipelines

Dispatch 宽度和 Decode 对齐，不过限制不少，实际很难跑满。

物理寄存器堆¶

测试物理寄存器堆大小，用两个依赖链很长的操作放在开头和结尾，中间填入若干无关指令来耗费物理寄存器堆：

32b int：speculative 32 位整数寄存器，拐点约 355
64b int：speculative 64 位整数寄存器，拐点约 192，只有 32b 的一半。猜测实际物理寄存器堆有 400 左右个 64 位寄存器，但可以分成两半各自当 32 位寄存器用
flags：speculative NZCV 寄存器，拐点约 82
32b fp：speculative 32 位浮点寄存器，观察到两次拐点，第一次和 32b int 接近，第二次和 64b int 接近

Store to Load Forwarding¶

官方信息：

The Neoverse V3 core allows data to be forwarded from store instructions to a load instruction with the restrictions mentioned below:

Load start address should align with the start or middle address of the older store
Loads of size greater than or equal to 8 bytes can get the data forwarded from a maximum of 2 stores. If there are 2 stores, then each store should forward to either first or second half of the load
Loads of size less than or equal to 4 bytes can get their data forwarded from only 1 store

描述和 Neoverse V2 相同。实测以下情况可以成功转发：

对地址 x 的 Store 转发到对地址 y 的 Load 成功时 y-x 的取值范围：

Store\Load	8b Load	16b Load	32b Load	64b Load
8b Store	{0}	{}	{}	{}
16b Store	{0,1}	{0}	{}	{}
32b Store	{0,2}	{0,2}	{0}	{-4,0}
64b Store	{0,4}	{0,4}	{0,4}	{-4,0,4}

一个 Load 需要转发两个 Store 的数据的情况：对地址 x 的 32b Store 和对地址 x+4 的 32b Store 转发到对地址 y 的 64b Load，在 Overlap 的情况下，要求 y=x，前半来自第一个 Store，后半来自第二个 Store。

和官方描述比较吻合，支持全部转发、转发前半、转发后半三种场景。针对常见的 64b Load，支持 y-x=-4。前半和后半也可以来自两个不同的 Store。对地址的对齐没有要求，跨缓存行边界也可以转发，只对 Load 和 Store 的相对位置有要求。转发成功时 5.3 Cycle，有 Overlap 但无法转发时 10.5 Cycle。

小结：ARM Neoverse V3 的 Store to Load Forwarding：

1 ld + 1 st: 要求 ld 和 st 地址相同或差出半个 st 宽度
1 ld + 2 st: 要求 ld 和 st 地址相同
1 ld + 4 st: 不支持

和 Neoverse V2 相同。

计算单元¶

官方信息：8x ALU, 3x Branch, 4x 128b SIMD

实测以下指令的吞吐：

int add: 6 IPC，只用到了 6 个 Single Cycle 单元，理论上两个 Multi Cycle 单元也能用上，但实际 IPC 达不到 8
int mul: 2 IPC，对应两个 Multi Cycle 单元
int not taken branch: 3 IPC，对应三个 Branch 单元
asimd fadd double: 4 IPC，对应四个 FP/ASIMD 单元

Load Store Unit¶

官方信息：1 Load/Store Pipe + 2 Load Pipe + 1 Store Pipe

一个周期内最多可以完成如下 Load/Store：

3x 64b Load
2x 64b Load + 2x 64b Store
1x 64b Load + 2x 64b Store
2x 64b Store

符合 1 LS + 2 LD + 1 ST pipe 的设计。相比 Neoverse V2 的 2 LS + 1 LD，同时 Load 和 Store 时性能更高。

每周期通过 load/store pair 指令可以完成的 128b 访存：

2x 128b Load
2x 128b Load + 2x 128b Store
1x 128b Load + 2x 128b Store
2x 128b Store

Load 没有跨越缓存行时，load to use 延迟 4 cycle；跨过 64B 缓存行边界时，增加到 5 cycle。与 Neoverse V2 相同。

Memory Dependency Predictor¶

为了预测执行 Load，需要确保它和之前的 Store 访问的内存没有 Overlap，所以需要一个预测器来预测这种依赖。参考 Store-to-Load Forwarding and Memory Disambiguation in x86 Processors 的方法，构造两种指令模式，分别测试数据和地址上的依赖：

数据依赖，地址无依赖：str x3, [x1] 和 ldr x3, [x2]
地址依赖，数据无依赖：str x2, [x1] 和 ldr x1, [x2]

初始化时 x1 和 x2 指向同一个地址，重复上述模式，观察性能下降时 ldr 指令的数量：

地址依赖的阈值是 56，数据依赖没有阈值。相比 Neoverse V2 有所增加。

Reorder Buffer¶

把两个串行的 fsqrt 序列放在循环的头和尾，中间用 NOP 填充。如果 ROB 足够大，执行开头串行 fsqrt 序列时可以同时执行结尾的，性能最优。ROB 不够大时则会出现性能下降。

测试发现大约 768 条 NOP 时出现性能下降。Neoverse V3 实现了 Instruction Fusion，两条 NOP 算做一条 uOP 和一条 MOP，所以 768 条 NOP 对应 384 MOP 的 ROB 大小。极限下 384 MOP 可以存 768 uOP，但实际很难达到，容易受限于其他结构。相比 Neoverse V2 的 320 MOP 有所增加。

L1 DCache¶

官方信息：64KB, 4-way set associative, VIPT behaving as PIPT, 64B cacheline, ECC protected, RRIP replacement policy, 4×64-bit read paths and 4×64-bit write paths for the integer execute pipeline, 3×128-bit read paths and 2×128-bit write paths for the vector execute pipeline

无论官方信息还是下面的实测结果，都和 Neoverse V2 相同。

容量¶

构造不同大小 footprint 的 pointer chasing 链，测试每条 load 指令的耗时：

64KB 处出现拐点，对应 L1 DCache 容量。之后延迟先上升后下降，与 ARM 采用的 Correlated Miss Caching (CMC) 预取器记住了 pointer chasing 的历史有关，详见 Arm Neoverse N2: Arm's 2nd generation high performance infrastructure CPUs and system IPs。

延迟¶

L1 DCache 的 load to use latency 是 4 cycle，没有针对 pointer chasing 做 3 cycle 优化。

吞吐¶

用 FP/ASIMD 128b Load 可以达到 3 IPC，对应 3x128b read paths；用 2x64b 整数 LDP 只能到 2 IPC，对应 4x64b read paths。要达到峰值读取性能，必须用 FP/ASIMD 指令。向量 128b Store 可以达到 2 IPC，对应 2x128b write paths；2x64b 整数 STP 也能到 2 IPC，对应 4x64b write paths。

VIPT¶

4KB page 下，64KB 4-way 的 L1 DCache 不满足 VIPT 的 Index 全在页内偏移的条件（详见 VIPT 与缓存大小和页表大小的关系）。此时要么用 PIPT，要么在 VIPT 基础上处理 alias 问题。参考浅谈现代处理器实现超大 L1 Cache 的方式的测试方法，用 shm 构造两个 4KB 虚拟页映射到同一个物理页，然后在两个虚拟页之间 copy，发现相比同一个虚拟页内 copy 有显著的性能下降，并产生了大量 L1 DCache Refill：

copy from aliased page = 8778731053 cycles, 55305 refills baseline = 5298206743 cycles, 31413 refills slowdown = 1.66x

这验证了 L1 DCache 采用的是 VIPT，并在正确性上做了 alias 处理。如果是 PIPT，L1 DCache 会发现两个页对应相同物理地址，性能不会下降，也不需要频繁 refill。

构造¶

为了支持每周期 3 条 Load，L1 DCache 通常会分 Bank，每个 Bank 有自己的读口。Load 分布到不同 Bank 上时可以同时读取；命中相同 Bank 但访问不同地址，就只能等下个周期。为了测试 Bank 构造，设计了一系列以不同固定 stride 间隔的 Load 指令：

Stride=1B/2B/4B/8B/16B/32B: IPC=3
Stride=64B: IPC=2
Stride=128B/256B/512B: IPC=1

Stride=64B 时出现 Bank Conflict，Stride=128B 时所有 Load 命中同一个 Bank，只能串行读取。根据这个现象，认为 Neoverse V3 的 L1 DCache 组织方式是：

一共有两个 Bank，Bank Index 是 VA[6]
每个 Bank 每周期可以从一个缓存行读取数据
支持多个 Load 访问同一个缓存行
多个 Load 访问同一个 Bank 的不同缓存行，只能一个周期完成一个 Load

这里讨论的是缓存行级别的 Bank。缓存行内部也会做 Bank 划分，但主要是为了功耗，比如从 64B 缓存行读 8B 数据，不需要把整个 64B 都读出来。

L1 DTLB¶

官方信息：Caches entries at the 4KB, 16KB, 64KB, 2MB or 512MB granularity, Fully associative, 96 entries.

用 pointer chasing 测试 L1 DTLB 容量，指针分布在不同的 page 上，让 DTLB 成为瓶颈：

96 Page 处出现拐点，对应 96 项的 L1 DTLB 容量。超出后需要额外 6 cycle 访问 L2 Unified TLB。容量相比 Neoverse V2 翻番。测试时注意避免 Huge Page 的影响。

L2 Unified TLB¶

官方信息：Shared by instructions and data, 8-way set associative, 2048 entries

L2 Cache¶

官方信息：2MB or 3MB, 8-way(2MB) or 12-way(3MB) set associative, 4 banks, PIPT, ECC protected, 64B cacheline

SVE¶

官方信息：128b SVE vector length

Linux 下查看 /proc/sys/abi/sve_default_vector_length，SVE 宽度为 16 字节，即 128b。

Neoverse V3 每周期最多执行 4 条 ASIMD 或 SVE 浮点 FMA 指令，浮点峰值性能：

单精度：128/32*2*4=32 FLOP per cycle
双精度：128/64*2*4=16 FLOP per cycle

与 Neoverse V2、Zen 2-4、Oryon、Firestorm、LA464、Haswell 等微架构看齐，但不及 Zen 5、Skylake 等通过 AVX512 提供的峰值浮点性能。

总结¶

Neoverse V3 相比 Neoverse V2 改动不算很大，主要变化：

Decode 宽度从 8-wide 增加到 10-wide，但去掉了 MOP Cache
ROB 从 320 MOP 增加到 384 MOP
LSU 从 2 LS + 1 LD 改为 1 LS + 2 LD + 1 ST
L1 DTLB 从 48 项翻倍到 96 项
Memory Dependency Predictor 从 40 增加到 56

整体上是一次稳健的迭代升级。

SPEC CPU 2026 Workload Analysis (FP Rate)

Fri, 29 May 2026 00:00:00 +0000

SPEC CPU 2026 Workload Analysis (FP Rate)¶

中文版本

Background¶

Following the INT Rate article, this article continues with the workload analysis of SPEC FP 2026 Rate.

The test environment is the same as the previous INT Rate article and won't be repeated here.

SPEC FP 2026 Rate Analysis¶

709.cactus_r¶

Cactus is a computational framework, used here to solve the Einstein equations in vacuum. Command:

cactus ShiftedGaugeWave.par

Measured runtime is 103.4s, reftime is 858s, corresponding to 8.30 points. Performance under different compilers and flags:

Compiler + Flags	Time (s)	Score	Improvement over GCC 14 `-O3` (%)
GCC 14 `-O3`	103.4	8.30	0
GCC 14 `-O3 -march=native`	83.9	10.23	23
GCC 14 `-O3 -ffast-math`	101.2	8.48	2
GCC 14 `-O3 -ljemalloc`	100.7	8.52	3
LLVM 22 `-O3`	94.6	9.07	9
LLVM 22 `-O3 -march=native`	90.5	9.48	14

-march=native provides a significant performance boost. LLVM 22 is faster than GCC 14 under -O3, but GCC 14's -O3 -march=native overtakes LLVM 22's -O3 -march=native. Details below.

Performance bottlenecks observed via perf:

ML_CCZ4::ML_CCZ4_EvolutionInteriorSplitBy2_Body from src/repos/mclachlan/ML_CCZ4/src/ML_CCZ4_EvolutionInteriorSplitBy2.cc: 41.30% of total time (same format below);
ML_CCZ4::ML_CCZ4_EvolutionInteriorSplitBy3_Body from src/repos/mclachlan/ML_CCZ4/src/ML_CCZ4_EvolutionInteriorSplitBy3.cc: 31.26%;
ML_CCZ4::ML_CCZ4_ConstraintsInterior_Body from src/repos/mclachlan/ML_CCZ4/src/ML_CCZ4_ConstraintsInterior_Body.cc: 6.71%;
ML_CCZ4::ML_CCZ4_EvolutionInteriorSplitBy1_Body from src/repos/mclachlan/ML_CCZ4/src/ML_CCZ4_EvolutionInteriorSplitBy3.cc: 6.44%.

These hotspot functions share a similar pattern: within three nested loops, they read data from corresponding 3D grid points, perform a series of Stencil memory accesses and floating-point operations (including heavy use of floating-point multiply, add, subtract, pow, and fabs), then write results back to arrays. The generated instructions use SSE for scalar double-precision floating-point without vectorization. During testing, compiler optimizations on pow and fabs were also observed. Under -O3, pow(a, 1) compiles to a, pow(a, 2) to a * a, and pow(a, -1) to 1.0 / a, but others like pow(a, 3) and pow(a, -2) fall back to libm's pow implementation. With -O3 -ffast-math, pow(a, 3) becomes a * a * a and pow(a, -2) becomes 1.0 / (a * a). See the comparison at Godbolt. In the code, the main occurrences are pow(a, -1), pow(a, 2), pow(a, -2), and pow(a, runtimeVariable), where runtimeVariable is a value only known at runtime, corresponding to shiftAlphaPower or harmonicN in the code. fabs is compiled into the bitwise andpd instruction, directly zeroing the sign bit.

With -O3 -march=native, vectorization still doesn't happen. It uses AVX2 instructions for scalar double-precision floating-point, with remaining calls to libm's pow for the cases mentioned above (pow(a, -2) or pow(a, runtimeVariable)). However, the rest of the computation benefits from vfmadd132sd/vfnmadd132sd, and vaddsd becomes a three-operand instruction (compared to the two-operand addsd) that also allows memory operands, further reducing instruction count. On ARM64, -march=native provides no improvement because the floating-point fused multiply-add instruction is available even without -march=native, see Godbolt. In a sense, the huge improvement from -march=native on AMD64 reflects a first-mover disadvantage: the baseline corresponds to very old processors lacking many important ISA extensions. This compatibility burden doesn't exist on many other ISAs; for instance, fused multiply-add (FMA) is already part of the baseline in many ISAs, where -march=native brings relatively smaller improvements. As a workaround, many software projects manually provide multiple code paths for different ISA extensions and select the best one at runtime based on availability. If compilers could do this automatically, it would bring nice overall performance improvements while maintaining compatibility and developer convenience.

Performance counter comparison across compilation options:

Compiler + Flags	Time (s)	Insns (B)	Load (B)	Store (B)	Branch (B)	FP Scalar (B)	FP Vector (B)
GCC 14 `-O3`	103.4	1423.6	747.8	110.1	9.8	677.0	5.2
GCC 14 `-O3 -march=native`	83.9	988.5	711.9	89.5	8.9	686.1	2.6
GCC 14 `-O3 -ffast-math`	101.8	1387.7	742.2	103.4	5.3	641.0	5.6
GCC 14 `-O3 -ljemalloc`	100.7	1423.6	747.8	110.1	9.8	677.0	5.2
LLVM 22 `-O3`	94.6	1323.1	659.1	96.6	6.1	659.0	15.2
LLVM 22 `-O3 -march=native`	90.5	1054.5	690.7	119.4	5.4	681.4	5.4

Total instruction count comes from instructions, Load from mem_inst_retired.all_loads, Store from mem_inst_retired.all_stores, Branch from branch-instructions, FP Scalar from fp_arith_inst_retired.scalar, and FP Vector from fp_arith_inst_retired.vector performance counters (same format below). Note that fused multiply-add instructions like vfmadd132sd are counted twice in fp_arith_inst_retired.scalar/vector.

From the table, under -O3 roughly half the instructions are Loads and the other half are floating-point scalar operations. This low compute-to-memory ratio is typical of Stencil computation: load a value from the grid neighborhood, do one multiply-add. With -O3 -march=native, FMA instructions reduce the instruction count substantially, but since FMA counts double and AVX2 instructions that perform both memory access and computation are counted in both Load and FP categories (the microarchitecture likely counts split micro-ops), the total instruction count no longer equals the sum of individual categories. The -O3 -ljemalloc option provides a slight performance advantage not reflected in instruction counts; its improvement mainly comes from better cache locality. GCC 14 and LLVM 22 have comparable performance under different flags. The generated instructions are similar in approach, with main differences in address computation, stack usage, and register allocation.

Notably, 709.cactus_r has high cache miss rates: under GCC 14 -O3, L1 ICache MPKI reaches 118.6B/1423.6B*1000=83.30, and L1 DCache MPKI is 125.6B/1423.6B*1000=88.23, the highest among both SPEC FP 2026 Rate and SPEC INT 2026 Rate. Cores with larger L1 ICache have an advantage here; L1 ICache bottlenecks at 32KB might disappear at 64KB. With -O3 -ljemalloc, L1 DCache MPKI drops to 111.7B/1423.6B*1000=78.46, yielding about 3% improvement with identical instruction counts compared to -O3.

722.palm_r¶

palm is a weather forecasting program that solves Navier-Stokes equations. Command:

palm_r < runfile_atmos

Measured runtime is 174.0s, reftime is 1320s, corresponding to 7.59 points. Performance under different compilers and flags:

Compiler + Flags	Time (s)	Score	Improvement over GCC 14 `-O3` (%)
GCC 14 `-O3`	174.0	7.59	0
GCC 14 `-O3 -march=native`	157.8	8.34	10
GCC 14 `-O3 -ffast-math`	168.4	7.84	3
GCC 14 `-O3 -ljemalloc`	172.4	7.66	1
LLVM 22 `-O3`	144.0	9.17	21
LLVM 22 `-O3 -march=native`	118.6	11.13	47

The trend is similar to 709.cactus_r: -O3 -march=native provides a massive performance boost, and LLVM 22 is significantly faster than GCC 14.

Hotspot functions:

advec_s_ws_ij from src/advec_ws.F90: 9.80%, classic 3D Stencil computation with balanced memory access and computation ratio, essentially load one point value then do multiply-add. Uses SSE for computation with partial vectorization (addpd/subpd/mulpd processing 2 double-precision elements per instruction), though some loops fail to vectorize and fall back to scalar instructions (addsd/subsd/mulsd);
advec_u_ws_ij from src/advec_ws.F90: 8.80%, same as above;
advec_v_ws_ij from src/advec_ws.F90: 8.54%, same as above;
advec_w_ws_ij from src/advec_ws.F90: 8.24%, same as above;
diffusion_e_ij from src/turbulence_closure_mod.F90: 5.14%, involves more complex floating-point operations like min/sqrt/div, plus bitwise operations using MERGE for ternary operations, no vectorization, scalar SSE floating-point.

Here is the Stencil computation code from advec_s_ws_ij, looping over i, j, k:

flux_r(k) = u_comp * ( &  37.0_wp * ( sk(k,j,i+1) + sk(k,j,i) ) &  - 8.0_wp * ( sk(k,j,i+2) + sk(k,j,i-1) ) &  + ( sk(k,j,i+3) + sk(k,j,i-2) ) ) * adv_sca_5

Performance counter comparison:

Compiler + Flags	Time (s)	Insns (B)	Load (B)	Store (B)	Branch (B)	FP Scalar (B)	FP Vector (B)
GCC 14 `-O3`	174.0	3416.6	1267.4	271.1	155.6	779.0	318.5
GCC 14 `-O3 -march=native`	157.8	2710.0	1212.8	242.5	147.1	785.9	172.6
GCC 14 `-O3 -ffast-math`	168.4	3373.5	1204.7	278.0	134.0	612.8	363.1
GCC 14 `-O3 -ljemalloc`	172.4	3368.4	1259.7	260.7	141.6	779.0	318.5
LLVM 22 `-O3`	144.0	2640.4	835.5	216.3	90.4	179.5	609.7
LLVM 22 `-O3 -march=native`	118.6	1643.8	586.5	165.6	67.6	180.8	306.7

With -O3 -march=native, heavy AVX2 vectorized instructions appear: vmulpd/vdivsd/vaddpd/vsubpd/vfmadd213sd/vfmsub132pd/vfmsub231pd/vmovupd, each processing 4 double-precision elements. Vectorization degree is high; on AVX512-capable processors, performance could be even higher. Compared to 709.cactus_r where pow and similar issues prevent vectorization, 722.palm_r's vectorization benefits are much more apparent. LLVM 22 under -O3 outperforms GCC 14 because it successfully vectorizes hotspot functions like advec_u/v/w_ws_ij, while GCC 14 still uses scalar instructions. This is reflected in significantly more FP vector instructions and fewer FP scalar instructions. Under LLVM 22, with those hotspot functions well-optimized, flow_statistics (from src/flow_statistics.F90, 5.79% time share) becomes the new bottleneck. It has limited vectorizable portions, hence its time share increases. Even with -O3 -march=native, it still uses AVX2+FMA instructions for scalar computation with little time difference. As other parts speed up, its time share further increases to 6.95%, similar to Amdahl's law.

709.cactus_r and 722.palm_r share the same Stencil computation pattern. Physics simulations frequently do this: solving differential equations in 3D space requires repeated computation over each point's neighborhood, which ultimately becomes Stencil.

731.astcenc_r¶

astcenc is an encoder for the ASTC lossy compressed image format. It runs three times:

# 1. linear astcenc_r ref-inputs-linear.txt # 2. hdr astcenc_r ref-inputs-hdr.txt # 3. precision astcenc_r ref-inputs-precision.txt

Measured runtimes are 49.9s, 72.1s, and 53.8s, totaling 175.8s, reftime 840s, corresponding to 4.78 points. Performance under different compilers and flags:

Compiler + Flags	Total Time (s)	1. linear (s)	2. hdr (s)	3. precision (s)	Score	Improvement over GCC 14 `-O3` (%)
GCC 14 `-O3`	175.8	49.9	72.1	53.8	4.78	0
GCC 14 `-O3 -march=native`	157.3	44.0	63.2	50.0	5.34	12
GCC 14 `-O3 -ffast-math`	160.5	44.6	67.2	48.7	5.23	10
LLVM 22 `-O3`	134.0	38.5	56.1	39.3	6.27	31
LLVM 22 `-O3 -march=native`	117.2	34.4	48.6	34.1	7.17	50

Another benchmark where LLVM 22 has a clear advantage over GCC 14. Other flags like -flto and -ljemalloc have almost no impact and are omitted. 731.astcenc_r has the highest MPKI in SPEC FP 2026 Rate at 5.0, much higher than most others which are below 1.0 (second highest is 737.gmsh_r at 3.33, third is 767.nest_r at only 0.83), and also higher than many SPEC INT 2026 Rate benchmarks. Below is per-workload analysis.

1. linear¶

Main hotspot functions:

compute_angular_endpoints_for_quant_levels from src/astcenc_weight_align.cpp: 18.93%, main bottleneck is in the inner loop doing single-precision floating-point scalar SSE computation, with calls to nearbyint from libm for rounding. The developers intentionally wrote SIMD-friendly code using vfloat4 for batch operations, with vmask4 storing comparison results (four ints, 0 for false, -1 for true), and a select function for vectorized ternary operations. Unfortunately the compiler doesn't cooperate, producing scalar SSE instead;
compute_avgs_and_dirs_3_comp_rgb from src/astcenc_averages_and_directions.cpp: 14.70%, similar pattern with vfloat4 and vmask4 computations in loops, but SSE instructions are all scalar;
compute_quantized_weights_for_decimation from src/astcenc_ideal_endpoints_and_weights.cpp: 13.34%, involves quantization with vint and table lookups (vtable_lookup_32bit). The vfloat/vint types are designed to automatically map to the platform's available SIMD width (defined in src/astcenc_vecmathlib.h, e.g., AVX maps to 8 elements with vfloat8, SSE to 4 elements with vfloat4), but these wider modes are disabled in SPEC, falling back to 4 elements;
compute_ideal_weights_for_decimation from src/astcenc_ideal_endpoints_and_weights.cpp: 9.57%, main bottleneck is a gather operation gatherf_byte_inds. Since SSE doesn't support gather, it splits into four elements with individual loads and scalar computation;
bilinear_infill_vla from src/astcenc_ideal_endpoints_and_weights.cpp: 7.80%, bottleneck is also the gather operation gatherf_byte_inds;
compute_error_squared_rgb from src/astcenc_averages_and_directions.cpp: 6.39%, bottleneck is gather plus subsequent vector computation, but GCC 14 compiles everything to scalar SSE.

The fact that native SIMD code compiles to scalar instructions also suggests that correct vectorization would yield significant additional performance. Furthermore, with -O3 -march=native, vectors widen to 256 bits, and the vblendvps instruction becomes available to implement the select function. As mentioned, LLVM 22 is significantly faster. Here's the comparison:

Compiler + Flags	Time (s)	Insns (B)	Load (B)	Store (B)	Branch (B)	FP Scalar (B)	FP Vector (B)	Mispred (M)	MPKI
GCC 14 `-O3`	49.9	835.7	259.3	55.6	63.2	188.6	28.6	3136.0	3.75
GCC 14 `-O3 -march=native`	44.0	652.4	234.0	46.3	52.9	184.6	28.5	3148.2	4.83
GCC 14 `-O3 -ffast-math`	44.6	780.5	259.8	54.6	49.3	159.9	43.2	2139.0	2.74
LLVM 22 `-O3`	38.5	829.7	235.0	34.8	36.1	68.8	155.6	1095.5	1.32
LLVM 22 `-O3 -march=native`	34.4	620.9	179.5	17.7	19.6	42.1	125.7	823.4	1.33

The counters show GCC 14 performs worse overall because LLVM 22 does more vectorization: its FP vector instructions far exceed FP scalar, with significantly fewer mispredictions and much lower MPKI. Detailed analysis follows.

First, let's look at how GCC 14 compiles 731.astcenc_r's SIMD-native code. Taking the hotspot functions analyzed above as examples, a common pattern uses vfloat4 comparison plus select to implement vectorized max:

vfloat4 vmax(vfloat4 a, vfloat4 b) {  vmask4 mask = b > a;  return select(a, b, mask); }

Under -O3, GCC 14 compiles this to:

vmax(vfloat4 a, vfloat4 b):  # a vector in xmm0 (a[0] and a[1]) and xmm1 (a[2] and a[3]) registers  # b vector in xmm2 (b[0] and b[1]) and xmm3 (b[2] and b[3]) registers  # although each element is single-precision, each xmm register only holds two elements  movq %xmm1, %rax # rax = a3 | a2  movq %xmm3, %rcx # rcx = b3 | b2  movq %xmm0, %rsi # rsi = a1 | a0  movd %ecx, %xmm1 # xmm1 = b2  movd %eax, %xmm6 # xmm6 = a2  shrq $32, %rcx # rcx = b3  movdqa %xmm2, %xmm5 # xmm5 = b1 | b0  shrq $32, %rax # rax = a3  movdqa %xmm2, %xmm0 # xmm0 = b1 | b0  movd %ecx, %xmm4 # xmm4 = b3  shufps $85, %xmm5, %xmm5 # xmm5 = b1 | b1 | b1 | b1  movd %eax, %xmm2 # xmm2 = a3  movd %esi, %xmm7 # xmm7 = a0  shrq $32, %rsi # rsi = a1  movdqa %xmm5, %xmm3 # xmm3 = b1 | b1 | b1 | b1  comiss %xmm2, %xmm4 # compare a3 and b3  movd %esi, %xmm5 # xmm5 = a1  seta %al # al = (b3 > a3)  comiss %xmm6, %xmm1 # compare b2 and a2  jbe .L14 # if a2 >= b2, jump to .L14  testb %al, %al  jne .L15 # if b3 > a3, jump to .L15  # here a2 < b2, a3 >= b3  maxss %xmm7, %xmm0 # xmm0 = max(a0, b0)  maxss %xmm5, %xmm3 # xmm3 = max(a1, b1)  unpcklps %xmm2, %xmm1 # xmm1 = a3 | b2  unpcklps %xmm3, %xmm0 # xmm0 = max(a1, b1) | max(a2, b2)  ret .L14: # handles a2 >= b2  testb %al, %al  jne .L16 # if b3 > a3, jump to .L16  #3 here a2 >= b2, a3 >= b3  movaps %xmm6, %xmm1 # xmm1 = a2  # omitted below: case analysis for a2 vs b2, a3 vs b3 .L17:  maxss %xmm7, %xmm0  maxss %xmm5, %xmm3  unpcklps %xmm2, %xmm1  unpcklps %xmm3, %xmm0  ret .L16:  movaps %xmm4, %xmm2  movaps %xmm6, %xmm1  jmp .L17 .L15:  maxss %xmm7, %xmm0  maxss %xmm5, %xmm3  movaps %xmm4, %xmm2  unpcklps %xmm2, %xmm1  unpcklps %xmm3, %xmm0  ret

Strangely, it first extracts input values into general-purpose registers, then separately compares the last two elements a2 vs b2 and a3 vs b3, using branches to handle four possible cases (knowing where the last two max elements come from), yet still uses maxss for the first two elements. Why not just use maxss for all four elements from the start? With -O3 -ffast-math, it inexplicably learns this:

vmax(vfloat4, vfloat4):  movq %xmm0, %rsi  movq %xmm1, %rcx  movq %xmm2, %rdx  movd %esi, %xmm1  movq %xmm3, %rax  movdqa %xmm2, %xmm0  shrq $32, %rdx  maxss %xmm1, %xmm0  shrq $32, %rsi  movdqa %xmm3, %xmm1  shrq $32, %rax  movd %ecx, %xmm3  shrq $32, %rcx  movd %edx, %xmm2  movd %esi, %xmm4  maxss %xmm3, %xmm1  movd %ecx, %xmm5  movd %eax, %xmm3  maxss %xmm4, %xmm2  maxss %xmm5, %xmm3  unpcklps %xmm2, %xmm0  unpcklps %xmm3, %xmm1  ret

But it still uses scalar SSE, while LLVM 22 knows how to vectorize with maxps:

vmax(vfloat4, vfloat4):  movlhps %xmm3, %xmm2  movlhps %xmm1, %xmm0  maxps %xmm2, %xmm0  movaps %xmm0, %xmm1  unpckhpd %xmm0, %xmm1  retq

The remaining instructions are only for handling calling convention data placement; within the function, typically a single maxps instruction completes the max computation for all 4 elements. This example illustrates why LLVM 22 is so much faster than GCC 14: GCC 14 generates many useless branches for the select comparison and fails to vectorize the max operation. Even with -march=native, GCC 14 still uses AVX instructions for scalar max operations. See Godbolt. GCC 14's high MPKI comes from exactly this. I also tested the same code on LoongArch, where vectorization support is similarly poor (see Godbolt), so I filed an issue. Considering only the vectorized fmax kernel, an optimized implementation using vfcmp.slt.s + vbitsel.v would be roughly 2.9x the performance of LLVM 22's current output. A small trivia point: x86 SSE/AVX max instructions implement a > b ? a : b logic, while LoongArch's fmax implements IEEE754 maxNum. These differ when NaN is present: the former returns b whenever either a or b is NaN, while the latter returns the non-NaN value when only one operand is NaN.

2. hdr¶

Main hotspot functions:

compute_angular_endpoints_for_quant_levels from src/astcenc_weight_align.cpp: 19.80%, see above;
compute_avgs_and_dirs_3_comp_rgb from src/astcenc_averages_and_directions.cpp: 15.37%, see above;
compute_quantized_weights_for_decimation from src/astcenc_ideal_endpoints_and_weights.cpp: 12.40%, see above;
compute_error_squared_rgb from src/astcenc_averages_and_directions.cpp: 6.91%, see above;
compute_ideal_weights_for_decimation from src/astcenc_ideal_endpoints_and_weights.cpp: 5.68%, see above.

Hotspot functions are essentially the same as 1. linear. GCC 14 generates many branches and scalar SSE instructions, while LLVM 22 vectorizes better and avoids unnecessary branches. Comparison:

Compiler + Flags	Time (s)	Insns (B)	Load (B)	Store (B)	Branch (B)	FP Scalar (B)	FP Vector (B)	Mispred (M)	MPKI
GCC 14 `-O3`	72.1	1091.8	306.9	78.6	91.7	245.8	30.4	4928.9	4.51
GCC 14 `-O3 -march=native`	63.1	851.4	271.2	65.2	77.4	240.1	30.4	4890.6	5.74
GCC 14 `-O3 -ffast-math`	67.1	1036.6	311.0	85.5	73.7	200.8	54.3	4077.0	3.93
LLVM 22 `-O3`	55.9	1107.9	276.5	55.9	56.9	111.8	129.9	1943.2	1.75
LLVM 22 `-O3 -march=native`	48.6	825.2	209.3	30.7	34.1	85.2	139.7	1411.6	1.71

3. precision¶

Hotspot functions are mostly the same as 1. linear and 2. hdr, with the addition of find_best_partition_candidates from src/astcenc_find_best_partitioning.cpp, where the main bottleneck is a / sqrt(length) computation. This time GCC 14 under -O3 actually vectorizes this step correctly via a scalar sqrtss, shufps to broadcast the result to all lanes, then divps for batch division. However, other hotspot functions still produce slow code as before. Performance counter comparison:

Compiler + Flags	Time (s)	Insns (B)	Load (B)	Store (B)	Branch (B)	FP Scalar (B)	FP Vector (B)	Mispred (M)	MPKI
GCC 14 `-O3`	53.8	711.5	176.8	62.0	61.3	177.0	9.3	5119.2	7.19
GCC 14 `-O3 -march=native`	49.2	570.5	161.3	57.1	54.7	176.1	9.2	5113.1	8.96
GCC 14 `-O3 -ffast-math`	48.7	655.9	168.3	64.6	49.8	156.5	19.5	4227.6	6.56
LLVM 22 `-O3`	39.3	729.9	149.2	42.8	35.9	75.3	77.2	1906.7	2.61
LLVM 22 `-O3 -march=native`	34.1	544.9	112.5	28.0	23.2	52.0	87.1	1445.7	2.65

Summary¶

731.astcenc_r uses SIMD-native programming with vfloat4, vint4, vmask4, etc., written with SIMD instructions in mind. Unfortunately GCC 14 fails to recognize the code's intent and utilize hardware instructions, inexplicably generating branches for the select function. LLVM 22 does much better, vectorizing where appropriate. Meanwhile, slightly less mainstream ISAs like LoongArch still lack adequate optimization for these code patterns, in both GCC and LLVM.

736.ocio_r¶

ocio stands for OpenColorIO. Similar to 731.astcenc_r, it processes images, but focuses more on color transformation rather than compression. This benchmark includes four workloads:

# 1. lut1d ocioperf --spec-validation-offset 101 --spec-validation-stride 17 --spec-validation-pixels 131 --bitdepths ui16 ui16 --iter 100 --test -1 --transform ctf/lut1d_halfdom.ctf # 2. mntr ocioperf --spec-validation-offset 202 --spec-validation-stride 19 --spec-validation-pixels 132 --bitdepths ui16 f32 --iter 200 --8kres --test 0 --transform ctf/mntr_srgb_identity.ctf # 3. aces ocioperf --spec-validation-offset 303 --spec-validation-stride 23 --spec-validation-pixels 133 --bitdepths f32 f32 --iter 20 --8kres --test -1 --transform clf/aces_to_video_with_look.clf # 4. heavy ocioperf --spec-validation-offset 404 --spec-validation-stride 29 --spec-validation-pixels 134 --bitdepths f32 f32 --iter 25 --test -1 --transform clf/heavy_transform.clf

reftime is 875s. Performance under different compilers and flags:

Compiler + Flags	Total Time (s)	1. lut1d (s)	2. mntr (s)	3. aces (s)	4. heavy (s)	Score	Improvement over GCC 14 `-O3` (%)
GCC 14 `-O3`	139.8	6.1	11.2	67.8	54.6	6.26	0
GCC 14 `-O3 -march=native`	105.0	4.2	10.2	49.6	40.1	8.33	33
GCC 14 `-O3 -ffast-math`	139.4	6.4	11.4	67.8	53.9	6.28	0.3
LLVM 22 `-O3`	128.9	6.8	11.3	61.7	49.0	6.79	8
LLVM 22 `-O3 -march=native`	105.3	5.4	9.6	49.3	40.9	8.31	33

Again, -O3 -march=native brings significant improvement. LLVM 22 still has a performance edge over GCC 14 under -O3, but they're essentially equal under -O3 -march=native. Detailed analysis below.

1. lut1d¶

Hotspot functions:

OpenColorIO_v2_2dev::BitDepthCast<BIT_DEPTH_F32, BIT_DEPTH_UINT16>::apply from src/ASWF-OpenColorIO/src/OpenColorIO/CPUProcessor.cpp: 45.16%, in a loop over float elements in the [0, 1] range, multiplies by 65535 to scale to uint16_t range, adds 0.5, clamps to uint16_t range, then converts float to uint16_t. Compiled to SSE vector instructions;
OpenColorIO_v2_2dev::Lut1DRendererHalfCode<BIT_DEPTH_UINT16, BIT_DEPTH_F32>::apply from src/ASWF-OpenColorIO/src/OpenColorIO/ops/lut1d/Lut1DOpCPU.cpp: 33.70%, loops over input uint16_t values doing table lookup (reading float values from a precomputed array indexed by uint16_t), bottleneck is SSE scalar indirect memory access;
__memmove_avx_unaligned_erms from libc: 13.28%, AVX-accelerated memmove;
__memset_avx2_unaligned_erms from libc: 3.55%, AVX-accelerated memset.

For this highly vectorizable code, -O3 -march=native improvement is substantial. In OpenColorIO_v2_2dev::BitDepthCast<BIT_DEPTH_F32, BIT_DEPTH_UINT16>::apply, it uses AVX2 256-bit vector computation and FMA instructions to fuse the scale and add-0.5 steps, followed by bitwise operations for clamping. This function's time share drops to 27.82% under -O3 -march=native, making the still-scalar-SSE OpenColorIO_v2_2dev::Lut1DRendererHalfCode<BIT_DEPTH_UINT16, BIT_DEPTH_F32>::apply the primary bottleneck at 42.85%.

In this sub-benchmark, GCC 14 is slightly faster than LLVM 22. Comparison:

Compiler + Flags	Time (s)	Insns (B)	Load (B)	Store (B)	Branch (B)	FP Scalar (B)	FP Vector (B)	Mispred (M)
GCC 14 `-O3`	6.1	106.2	23.3	11.7	4.2	2.6	5.0	2.6
GCC 14 `-O3 -march=native`	4.2	63.8	22.0	11.0	3.6	2.6	2.5	2.5
GCC 14 `-O3 -ffast-math`	6.4	104.8	23.2	11.7	4.2	2.5	5.0	2.6
LLVM 22 `-O3`	6.8	106.1	23.3	11.7	3.6	2.5	5.0	2.6
LLVM 22 `-O3 -march=native`	5.4	72.5	24.8	11.0	1.4	2.5	2.5	2.5

At the assembly level, GCC 14 and LLVM 22 differ in implementation. Both start with multiplication and addition, but differ in the clamping portion for handling 16-to-32-bit width conversion: GCC 14 mainly uses punpcklwd-type instructions, while LLVM 22 prefers pshufd-type instructions (see Godbolt). Although total instruction counts are close, different instructions require different execution times on hardware, resulting in some IPC difference. Similar situation after enabling -O3 -march=native.

2. mntr¶

Hotspot functions:

OpenColorIO_v2_2dev::BitDepthCast<BIT_DEPTH_UINT16, BIT_DEPTH_F32>::apply from src/ASWF-OpenColorIO/src/OpenColorIO/CPUProcessor.cpp: 55.41%, this time converting from uint16_t to float, so the computation becomes converting uint16_t to float then multiplying by 1.0/65535.0 (no clamping needed). The compiler vectorizes correctly, though the 16-to-32-bit width conversion takes considerable effort;
OpenColorIO_v2_2dev::ScaleRenderer::apply from src/ASWF-OpenColorIO/src/OpenColorIO/ops/matrix/MatrixOpCPU.cpp: 41.52%, simple per-pixel scaling of four components (from out[0] = in[0] * m_scale[0] to out[3] = in[3] * m_scale[3]). All pixels share the same m_scale array, which should be easy to vectorize, but it isn't because the pointers lack restrict annotations. The compiler cannot determine whether out and m_scale might alias; only if they don't overlap can it directly vectorize with mulps (see Godbolt).

Since AMD64 lacks vector instructions for mixed-width computation, much overhead goes to shuffling data between vectors rather than actual computation and memory access. RISC-V Vector's design does produce more concise instruction sequences here (see Godbolt). Comparison:

Compiler + Flags	Time (s)	Insns (B)	Load (B)	Store (B)	Branch (B)	FP Scalar (B)	FP Vector (B)	Mispred (M)
GCC 14 `-O3`	11.2	209.9	56.5	33.3	7.5	26.8	6.6	1.9
GCC 14 `-O3 -march=native`	10.2	159.6	54.8	29.9	7.1	26.8	3.3	1.8
GCC 14 `-O3 -ffast-math`	11.4	209.7	56.5	33.3	7.5	26.7	6.6	1.8
LLVM 22 `-O3`	11.3	194.5	56.5	33.3	8.6	26.5	6.7	1.9
LLVM 22 `-O3 -march=native`	9.6	149.4	58.2	29.9	2.8	26.5	3.4	2.0

3. aces¶

Hotspot functions:

OpenColorIO_v2_2dev::Lut3DTetrahedralRenderer::apply from src/ASWF-OpenColorIO/src/OpenColorIO/ops/lut3d/Lut3DOpCPU.cpp: 50.74%, complex operations per element: multiply, clamp, floor and ceil converted to int, then index-based table lookup with indirect memory access, followed by weighted averaging. Low vectorization;
OpenColorIO_v2_2dev::MatrixRenderer::apply from src/ASWF-OpenColorIO/src/OpenColorIO/ops/matrix/MatrixOpCPU.cpp: 11.55%, matrix operations multiplying input 4D vectors by a 4x4 matrix. High vectorization;
__log2f_fma from libm: 10.02%, computing float log2;
OpenColorIO_v2_2dev::CameraLin2LogRenderer::apply from src/ASWF-OpenCOlorIO/src/OpenColorIO/ops/log/LogOpCPU.cpp: 9.76%, checks input range; if below threshold m_linb, uses linear multiply-add; otherwise calls log2 combined with multiply-add and max operations. Low vectorization.

Comparison:

Compiler + Flags	Time (s)	Insns (B)	Load (B)	Store (B)	Branch (B)	FP Scalar (B)	FP Vector (B)	Mispred (M)
GCC 14 `-O3`	67.8	1258.9	299.3	86.3	100.5	260.6	28.0	146.6
GCC 14 `-O3 -march=native`	49.6	873.7	289.0	84.9	84.0	257.4	14.0	135.4
GCC 14 `-O3 -ffast-math`	67.8	1251.5	296.4	94.4	109.9	213.7	43.8	150.6
LLVM 22 `-O3`	61.7	1152.4	416.6	136.7	133.7	329.0	15.4	168.5
LLVM 22 `-O3 -march=native`	49.3	857.8	342.8	92.6	84.4	329.0	13.0	151.6

The performance gap between GCC 14 and LLVM 22 under -O3 mainly comes from floor/ceil handling: GCC 14 generates a complex series of SSE instructions (lacking SSE4.1's roundps), while LLVM 22 calls libm's __floorf_sse41, whose function body is essentially a single SSE4.1 roundps instruction plus return. Although there's function call overhead (call/ret plus register save/restore with extra Loads and Stores), it's still a net win. However, on processors truly without SSE4.1, GCC 14's approach would be faster. This trade-off cannot be resolved without -march=native; one can only guess which case is more probable. Today, AMD64 processors with SSE4.1 far outnumber those without.

After enabling -O3 -march=native, the vroundps instruction replaces the previous ceil/floor implementations (GCC 14's vectorized approach or LLVM 22's libm calls), giving both compilers significant improvement and bringing them to the same level. FMA also successfully fuses many multiply-add computations.

4. heavy¶

Hotspot functions:

__powf_fma from libm: 26.17%;
OpenColorIO_v2_2dev::Lut3DRenderer::apply from src/ASWF-OpenColorIO/src/OpenColorIO/ops/lut3d/Lut3DOpCPU.cpp: 25.69%, similar pattern to Lut3DTetrahedralRenderer::apply above with clamp/floor/ceil and table lookup, just with different final computation, all scalar SSE;
OpenColorIO_v2_2dev::Lut1DRenderer<BIT_DEPTH_F32, BIT_DEPTH_F32>::apply from src/ASWF-OpenColorIO/src/OpenColorIO/ops/lut1d/Lut1DOpCPU.cpp: 15.63%, similar to Lut3DRenderer::apply but simpler 1D table lookup, still all scalar;
OpenColorIO_v2_2dev::CDLRendererFwd<true>::apply: 10.88%, calls pow (causing __powf_fma's high share), plus floating-point multiply, add/sub, and clamp. All scalar;
OpenColorIO_v2_2dev::GammaMoncurveOpCPUFwd::apply: 5.41%, also calls pow, with additional floating-point operations and comparisons.

Comparison:

Compiler + Flags	Time (s)	Insns (B)	Load (B)	Store (B)	Branch (B)	FP Scalar (B)	FP Vector (B)	Mispred (M)
GCC 14 `-O3`	54.6	1013.5	209.4	57.0	80.8	253.7	5.8	32.0
GCC 14 `-O3 -march=native`	40.9	764.7	204.0	54.8	70.8	260.2	3.3	31.8
GCC 14 `-O3 -ffast-math`	53.9	971.0	202.1	50.5	80.6	252.3	6.6	29.1
LLVM 22 `-O3`	49.0	861.5	250.4	77.3	102.7	215.6	29.9	28.8
LLVM 22 `-O3 -march=native`	40.9	726.8	206.9	55.4	67.3	255.6	25.7	28.5

The performance difference between LLVM 22 and GCC 14 is the same as in 3. aces: ceil/floor handling. Additionally, like 731.astcenc_r, for vectorized min/max operations, LLVM 22 correctly vectorizes to maxps/minps while GCC 14 produces verbose code.

Summary¶

736.ocio_r is another application well-suited for vectorization. Although it doesn't use vfloat4 directly like 731.astcenc_r, it's image processing where each loop iteration handles one pixel with four channels. In many cases these four channels undergo identical computation, making it very amenable to vectorization. LLVM 22 under -O3 generates better code than GCC 14, from floor/ceil mapping to libm functions to better vectorization. However, with -O3 -march=native, the performance gap between GCC 14 and LLVM 22 becomes negligible, indicating that with sufficient ISA extensions enabled, both converge to similar implementations. This also suggests GCC 14's SSE code generation has deficiencies: perhaps it's not that GCC 14 cannot vectorize (since it does so with -O3 -march=native), but rather it doesn't know how to express vectorized code with SSE after attempting vectorization, so it falls back to scalar.

737.gmsh_r¶

737.gmsh_r is a 3D CAD meshing software with seven workloads:

# 1. choi gmsh_r -option gmsh.opts -nt 0 choi.geo # 2. mediterranean gmsh_r -option gmsh.opts -nt 0 mediterranean.geo # 3. projection gmsh_r -option gmsh.opts -nt 0 projection.geo # 4. gasdis gmsh_r -option gmsh.opts -nt 0 gasdis.geo # 5. Torus gmsh_r -option gmsh.opts -nt 0 Torus.geo # 6. spec gmsh_r -option gmsh.opts -nt 0 spec.geo -clscale 0.175 -algo del2d -algo hxt # 7. p19 gmsh_r -option gmsh.opts -nt 0 p19.geo

Workload runtimes are 17.1s, 11.8s, 11.2s, 16.9s, 9.2s, 13.4s, and 12.8s, totaling 92.2s, reftime 459s, corresponding to 4.98 points. Both -O3 -ffast-math and -O3 -march=native yield minimal benefit; LLVM 22 is actually slower than GCC 14, so detailed comparison is omitted.

When compiling with -O3 -march=native, if CC is set to just gcc without passing -std=c18, the 4. gasdis workload enters an infinite loop, continuously reporting: Info : Symbolic perturbation failed (2 superposed vertices ?). The difference is whether FMA contraction occurs: with -O3 -std=c18 -march=native, contraction doesn't happen; with -O3 -march=native or -O3 -std=gnu18 -march=native, it does (see Godbolt). In other programs FMA contraction improves performance, but here it unfortunately causes an infinite loop. This relates to -fp-contract:

-ffp-contract=style   -ffp-contract=off disables floating-point expression contraction. -ffp-contract=fast enables floating-point expression contraction such as forming of fused multiply-add operations if the target has native support for them. -ffp-contract=on enables floating-point expression contraction if allowed by the language standard. This is implemented for C and C++, where it enables contraction within one expression, but not across different statements.   The default is -ffp-contract=off for C in a standards compliant mode (-std=c11 or similar), -ffp-contract=fast otherwise.

This only affects C code, not C++, so in practice only 737.gmsh_r is affected. Although 709.cactus_r also has C code, its main computation is in C++.

Per-workload hotspot analysis follows.

1. choi¶

Hotspot functions:

netgen::ADTree6::GetIntersecting from src/gmsh/contrib/Netgen/libsrc/gprim/adtree.cpp: 18.40%, implements a 6-dimensional KD-Tree search algorithm. Main bottleneck is the data-dependent branch if (node->pi != -1) with high misprediction rate;
__ieee754_atan2_fma from libm: 6.64%;
reparamMeshVertexOnFace from src/gmsh/src/geo/MVertex.cpp: 6.03%, enters different if-else branches based on vertex dimension, with significant mispredictions.

Although floating-point is used, the computation pattern doesn't lend itself to vectorization. KD-Tree search naturally has high MPKI. Executed 204.7B instructions with 744.3M mispredictions, MPKI = 744.3M/204.7B*1000=3.64, second highest in SPEC FP 2026 Rate. The highest, 731.astcenc_r, is essentially due to GCC's poor implementation as discussed above; it could be optimized to around LLVM 22's 1.3, which would make 737.gmsh_r first.

2. mediterranean¶

Hotspot functions:

meshGEdgeProcessing from src/gmsh/src/mesh/meshGEdge.cpp: 36.55%, main bottleneck is Gauss-Seidel iteration in a loop, where scalar division and comparisons take considerable time;
KDTreeSingleIndexAdaptor::searchLevel from src/gmsh/src/numeric/nanoflann.hpp: 33.50%, another classic KD-Tree search, recursing into left or right subtrees based on input value;
InterpolateCurve from src/gmsh/src/geo/GeoInterpolation.cpp: 6.53%, recursive interpolation computation.

Although floating-point is involved, the computation pattern is not vectorization-friendly because intermediate results feed into if-branches, with additional floating-point computation inside the branches.

3. projection¶

Hotspot functions:

laplaceSmoothing from src/gmsh/src/mesh/meshGFaceOptimize.cpp: 11.73%, main bottleneck is std::set operations (which is backed by std::map), hence the std::map functions below;
std::map::_M_get_insert_unique_pos from libstdc++: 7.49%, std::map insertion algorithm;
__ieee754_atan2_fma from libm: 7.21%;
reparamMeshVertexOnFace: 6.66%, see above;
std::map::_M_get_insert_unique from libstdc++: 6.09%, std::map insertion;
SetRotationMatrix from src/gmsh/src/geo/Geo.cpp: 5.01%, multi-layer loops suitable for vectorization, and the compiler does vectorize, though time share is low.

The main bottleneck in this workload is std::map operations.

4. gasdis¶

Hotspot functions:

MakeHybridHexTetMeshConformalThroughTriHedron from src/gmsh/src/mesh/meshCombine3D.cpp: 30.18%, main bottleneck is std::map searches in a loop;
parallelDelaunay3D from src/gmsh/contrib/hxt/tetMesh/src/hxt_tetDelaunay.c: 9.05%, Delaunay triangulation algorithm;
hxtRefineTetrahedra from src/gmsh/contrib/hxt/tetMesh/src/hxt_tetRefine.c: 5.18%, loop with floating-point computation including add/sub, mul/div, and sqrt.

Bottleneck is mainly std::map.

5. Torus, 6. spec, and 7. p19¶

The last three workloads have the same hotspot functions as 4. gasdis.

Summary¶

Per-workload data:

Workload	Time (s)	Insns (B)	Load (B)	Store (B)	Branch (B)	FP Scalar (B)	FP Vector (B)	Mispred (M)	MPKI
1. choi	17.0	204.7	59.3	25.6	39.4	22.1	0.3	744.3	3.64
2. mediterranean	11.7	190.7	57.4	23.2	24.0	28.5	2.4	71.0	0.37
3. projection	11.1	109.0	29.1	14.4	20.3	13.3	2.2	183.0	1.68
4. gasdis	16.9	157.8	46.3	17.8	27.6	19.6	0.2	689.9	4.37
5. Torus	9.2	77.3	21.9	8.2	13.4	9.4	0.5	380.4	4.92
6. spec	13.3	101.4	30.2	10.8	18.1	10.9	0.2	546.1	5.39
7. p10	12.7	96.3	28.8	10.2	17.2	10.4	0.1	529.3	5.50

Overall MPKI is high, largely attributable to KD-Tree queries and std::map queries/insertions, although the tree keys are single-precision floats. Based on the analysis, the code indeed isn't suitable for vectorization, and FMA contraction is disabled since it would cause non-convergence.

748.flightdm_r¶

flightdm is a flight dynamics simulator with eight workloads:

# 1. weather JSBSim --nohighlight scripts/weather-balloon2.xml # 2. B747 JSBSim --nohighlight scripts/B747_script1.xml # 3. x153 JSBSim --nohighlight scripts/x153.xml # 4. c3104 JSBSim --nohighlight scripts/c3104.xml # 5. ah1s JSBSim --nohighlight scripts/ah1s_flight_test.xml # 6. orbit_torque JSBSim --nohighlight scripts/ball_orbit_g_torque.xml # 7. orbit_torque2 JSBSim --nohighlight scripts/ball_orbit_g_torque2.xml # 8. orbit JSBSim --nohighlight scripts/ball_orbit.xml

Workload runtimes are 5.9s, 14.7s, 10.9s, 11.3s, 24.8s, 8.0s, 9.8s, and 8.4s, totaling 93.9s, reftime 716s, corresponding to 7.63 points. -O3 -march=native only gives 2% improvement; -O3 -ljemalloc provides 4%; -O3 -flto gives 11%. LLVM 22 is slower than GCC 14.

1. weather¶

Hotspot functions:

__sincos_fma from libm: 6.75%;
__ieee754_atan2_fma from libm: 6.41%;
__strncmp_avx2 from libc: 5.04%;
parse_path from src/JSB-FlightSim/src/simgear/props/props.cxx: 4.43%, path string parsing, splitting into components;
__ieee754_pow_fma from libm: 4.05%.

The hotspots are quite unusual: mostly libm/libc functions, and flightdm's own most time-consuming function is a path parser. Various optimization flags having no effect is unsurprising.

2. B747¶

Hotspot functions:

SGPropertyNode::getDoubleValue from src/JSB-FlightSim/src/simgear/props/props.cxx: 5.65%, appears to be parsing configuration files and extracting floating-point values;
__ieee754_atan2_fma from libm: 5.42%;
__sincos_fma from libm: 5.25%.

Nothing interesting to analyze.

3. x153 and 4. c3104¶

Same hotspot functions as 2. B747.

5. ah1s¶

Hotspot functions:

SGPropertyNode::getDoubleValue from src/JSB-FlightSim/src/simgear/props/props.cxx: 8.45%, see above;
JSBSim::aFunc::getValue from src/JSB-FlightSim/src/math/FGFunction.cpp: 7.20%, a memoized std::function-like container;
__sincos_fma from libm: 6.04%;
__ieee754_atan2_fma from libm: 5.35%;
JSBSim::FGPropertyValue::getValue from src/JSB-FlightSim/src/math/FGPropertyValue.cpp: 5.11%, calls getDoubleValue above.

The overall impression: either calling libm for transcendental functions or extracting configuration file contents.

6. orbit_torque¶

Hotspot functions:

__ieee754_atan2_fma from libm: 7.52%;
__sincos_fma from libm: 6.82%;
__strncmp_avx2 from libc: 6.57%;
parse_path from src/JSB-FlightSim/src/simgear/props/props.cxx: 6.12%, path string parsing, splitting into components;
SGPropertyNode::getChild from src/JSB-FlightSim/src/simgear/props/props.cxx: 4.05%, traverses child nodes via string comparison to find matching children.

7. orbit_torque2 and 8. orbit¶

Same hotspot functions as 6. orbit_torque.

Summary¶

748.flightdm_r is an uninteresting benchmark. Much time is spent in libm and libc functions, while its own code just traverses configuration files. I'd call it a libm benchmark. Beyond that, it behaves more like a SPEC INT 2026 Rate workload: string operations, memory allocation, many small functions and lambdas, suitable for -O3 -flto optimization. Per-workload data under -O3:

Workload	Time (s)	Insns (B)	Load (B)	Store (B)	Branch (B)	FP Scalar (B)	FP Vector (B)	Mispred (M)	MPKI
1. weather	5.9	106.1	30.8	15.4	19.5	12.9	0.6	11.6	0.11
2. B747	14.8	260.1	80.0	38.7	49.4	28.4	1.7	25.6	0.10
3. x153	10.8	193.3	59.1	28.7	37.3	20.0	1.0	20.9	0.11
4. c3104	11.4	194.6	58.9	29.1	35.7	23.9	1.3	18.2	0.09
5. ah1s	24.7	407.3	130.0	61.3	77.9	46.4	1.6	49.3	0.12
6. orbit_torque	7.9	152.8	41.9	22.7	28.3	16.3	1.1	24.2	0.16
7. orbit_torque2	9.9	191.4	52.5	28.4	35.3	21.0	1.2	17.1	0.09
8. orbit	8.4	161.6	44.3	23.9	30.0	17.2	1.0	16.3	0.10

Unremarkable.

749.fotonik3d_r¶

Finally, a familiar face from SPEC FP 2017 Rate (previously 549.fotonik3d_r). fotonik3d solves Maxwell's equations in 3D space. Another physics-based benchmark; 3D PDE solvers invariably involve Stencil, and let's see if this holds. Single workload:

fotonik3d_r

reftime is 1156s. Performance under different flags:

Compiler + Flags	Time (s)	Score	Improvement over GCC 14 `-O3` (%)	Insns (B)	Load (B)	Store (B)	Branch (B)	FP Scalar (B)	FP Vector (B)
GCC 14 `-O3`	131.1	8.82	0	1408.5	375.1	120.7	30.9	5.4	527.2
GCC 14 `-O3 -march=native`	114.9	10.1	14	670.1	274.1	82.4	27.1	5.5	249.4
GCC 14 `-O3 -ffast-math`	116.7	9.91	12	1117.6	378.4	120.8	30.7	4.8	396.2
GCC 14 `-O3 -ffast-math -march=native`	108.5	10.65	21	599.5	276.3	82.3	26.9	4.8	204.8

LLVM 22 performs similarly to GCC 14 and is omitted. Both -O3 -march=native and -O3 -ffast-math provide solid improvements. Hotspot analysis:

power_dft from src/power.F90: 30.92%, performs DFT (Discrete Fourier Transform), bottleneck is double-precision floating-point multiply-add in loops, compiled to SSE vector instructions by GCC 14;
UPML_updateE_simple from src/UPML.F90: 24.73%, 3D Stencil computation, SSE vector instructions;
UPML_updateH from src/UPML.F90: 23.26%, 3D Stencil computation, SSE vector instructions;
mat_updateE from src/material.F90: 11.04%, Stencil computation, SSE vector instructions;
updateH from src/update.F90: 9.78%, Stencil computation, SSE vector instructions.

Besides power_dft, most time is spent on Stencil computation. This time the Stencil pattern is purer since GCC can vectorize well with SSE. Based on earlier experience, such programs benefit greatly from -O3 -march=native, -O3 -ffast-math, and their combination.

With -march=native, wider AVX2 vectors bring higher parallelism, plus FMA instructions like vfmaddsub231pd.

With -O3 -ffast-math, the core computation in power_dft is essentially complex multiplied by real, then added to complex, as shown in this Fortran code:

subroutine update(Efreq1, Efreq2, expfuncE, Efield1, Efield2, n)  implicit none  integer, intent(in) :: n  complex(8), intent(inout) :: Efreq1(n), Efreq2(n)  complex(8), intent(in) :: expfuncE(n)  real(8), intent(in) :: Efield1, Efield2  integer :: i   do i = 1, n  Efreq1(i) = Efreq1(i) + expfuncE(i) * Efield1  Efreq2(i) = Efreq2(i) + expfuncE(i) * Efield2  end do end subroutine update

Under -O3, GCC 14 faithfully implements complex multiplication. However, Efield1 and Efield2 are real numbers, so the converted complex has zero imaginary part. With -O3 -ffast-math, this simplifies to directly multiplying the real part into expfuncE's real and imaginary components. With -O3 -ffast-math -march=native, both optimizations combine: the AVX2 FMA instruction vfmadd213pd replaces the vfmaddsub231pd needed under -O3 -march=native (which simultaneously adds and subtracts; the subtraction comes from the complex multiplication definition, but subtracts zero here since Efield1/Efield2's imaginary part is zero). See Godbolt.

In summary, 749.fotonik3d_r is a classic floating-point application with heavy Stencil and vector floating-point operations, high parallelism, amenable to vectorization, and benefits from -ffast-math computation order optimization.

765.roms_r¶

Another returnee from SPEC FP 2017 Rate (previously 554.roms_r), implementing ocean simulation. Unsurprisingly, it's Stencil again. Single workload:

roms_r < roms_benchmark2.in.x

reftime is 1575s. Performance:

Compiler + Flags	Time (s)	Score	Improvement over GCC 14 `-O3` (%)	Insns (B)	Load (B)	Store (B)	Branch (B)	FP Scalar (B)	FP Vector (B)
GCC 14 `-O3`	169.8	9.28	0	2620.6	874.8	204.7	192.1	193.3	709.2
GCC 14 `-O3 -march=native`	149.5	10.5	14	1317.9	555.3	125.0	126.6	164.9	365.9
GCC 14 `-O3 -ffast-math`	162.8	9.67	4	2518.6	854.5	204.0	178.5	134.0	711.7
LLVM 22 `-O3`	165.6	9.51	3	2434.3	834.9	190.3	164.1	231.8	687.0
LLVM 22 `-O3 -march=native`	152.1	10.4	12	1423.4	551.4	131.2	140.1	259.8	350.0

Heavy floating-point computation with high vectorizability; -O3 -march=native improvement is expected.

Hotspot functions:

step2d_tile from src/step2d_LF_AM3.h: 20.37%, 2D Stencil computation, high vectorization;
pre_step3d from src/pre_step3d.F90: 10.43%, floating-point computation in loops, high vectorization;
lmd_skpp from src/lmd_skpp.F90: 8.91%, complex floating-point computation in loops, mainly scalar;
step3d_t_tile from src/step3d_t.F90: 7.04%, 3D Stencil computation, high vectorization;
rhs3d from src/rhs3d.F90: 6.04%, 2D Stencil computation, high vectorization;
t3dmix2 from src/t3dmix2_geo.h: 5.86%, 3D Stencil computation, high vectorization;
step3d_uv_tile from src/step3d_uv.F90: 5.85%, 3D Stencil computation, high vectorization;
_ZGVbN2v_exp_sse4 from libmvec: 4.66%, vectorized exp.

Typical Stencil computation with high vectorization. With -O3 -march=native, wider vectors plus FMA naturally bring solid improvements.

766.femflow_r¶

femflow is a fluid dynamics solver for Navier-Stokes equations. Single workload:

femflow_r refrate.prm

reftime is 1467s. Performance:

Compiler + Flags	Time (s)	Score	Improvement over GCC 14 `-O3` (%)	Insns (B)	Load (B)	Store (B)	Branch (B)	FP Scalar (B)	FP Vector (B)
GCC 14 `-O3`	188.7	7.77	0	3862.4	1358.5	797.6	117.5	562.2	676.0
GCC 14 `-O3 -march=native`	95.1	15.4	98	1736.9	619.3	356.0	65.2	286.8	445.4
GCC 16 `-O3`	153.6	9.55	23	3178.6	1109.3	673.3	127.2	56.3	930.9
GCC 16 `-O3 -march=native`	83.5	17.57	126	1457.0	501.1	281.4	61.1	47.2	545.7
LLVM 22 `-O3`	124.7	11.8	51	2703.0	857.3	475.5	60.6	40.8	930.3
LLVM 22 `-O3 -march=native`	88.7	16.5	113	1392.9	495.7	269.4	42.9	41.8	471.1

LLVM 22 provides significant improvement over GCC 14, and -O3 -march=native brings even more dramatic gains. This is the second-highest -O3 -march=native improvement in SPEC FP 2026 Rate (first is 772.marian_r below). GCC 16 also improves notably over GCC 14, overtaking LLVM 22 with -O3 -march=native.

There are many hotspot functions, mostly single-digit percentage each, mainly computational operators:

Laplace::LaplaceOperator::local_apply_quadratic_geo from src/laplace_operator.h: 5.49%, heavy floating-point vector computation with high parallelism;
operator *(const dealii::VectorizedArray &, const dealii::VectorizedArray &) from src/dealii/include/deal.ll/base/vectorization.h: 5.36%, element-wise vector multiplication.

Other functions include dealii::Tensor computations, including dealii::internal::even_odd_apply from src/dealii/include/deal.ll/matrix_free/tensor_product_kernels.h, implementing Tensor double-precision floating-point multiplication. The "even-odd" refers to exploiting data symmetry by splitting into even and odd parts, reducing computation count while being vectorization-friendly. For such workloads, -O3 -march=native provides better floating-point performance through wider vectors plus FMA.

LLVM 22's advantage over GCC 14 comes from vectorizing more code: comparing instruction counts, LLVM 22 executes fewer FP scalar instructions and more FP vector instructions. GCC 16 shows a similar pattern, approaching LLVM 22's vectorization level.

767.nest_r¶

nest is a spiking neural network simulator. This benchmark has three workloads:

# 1. cuba nest_r cuba_stdp.sli # 2. structural nest_r structural_plasticity_benchmark # 3. Artificial nest_r ArtificialSynchrony

-O3 -march=native gives only 3% improvement; LLVM 22 is slower than GCC 14. Per-workload data under GCC 14 -O3:

Workload	Time (s)	Insns (B)	Load (B)	Store (B)	Branch (B)	FP Scalar (B)
1. cuba	14.1	176.3	54.5	21.6	22.4	29.2
2. structural	24.6	413.3	136.3	42.8	52.5	93.2
3. Artificial	48.6	1125.4	392.6	150.5	160.5	163.6

Total time 87.4s, reftime 793s, corresponding to 9.07 points.

1. cuba¶

Hotspot functions:

nest::iaf_psc_exp::handle from src/nest-simulator/models/iaf_psc_exp.cpp: 25.75%, processes incoming spikes and updates internal state. Main bottleneck is indirect memory access, writing spike weights to corresponding input buffers;
__ieee754_pow_fma from libm: 11.96%, called by nest::Connector::send below;
spec::poisson_distribution::operator() from src/specrand-distributions/spec_random_distributions.cpp: 9.87%, random number generation for input spike generation;
nest::Connector::send from src/nest-simulator/nestkernel/connector_base.h: 8.29%, spike propagation through synapses with STDP. Main bottleneck is indirect memory access, plus inlined weight computation with pow and exp calls;
nest::iaf_psc_exp::update from src/nest-simulator/models/iaf_psc_exp.cpp: 6.91%, neuron state update at each timestep, mainly scalar floating-point.

A classic SNN simulation with STDP. Main bottlenecks are spike propagation and STDP synaptic weight updates, with very low vectorization and indirect memory access.

2. structural¶

Hotspot functions:

spec::poisson_distribution::operator() from src/specrand-distributions/spec_random_distributions.cpp: 24.26%, see above;
nest::iaf_psc_alpha::update from src/nest-simulator/models/iaf_psc_alpha.cpp: 13.71%, similar to nest::iaf_psc_exp::update but different neuron model;
__ieee754_pow_fma from libm: 13.37%, see above;
nest::GrowthCurveGaussian::update from src/nest-simulator/nestkernel/growth_curve.cpp: 6.60%, numerical ODE solving with frequent exp and pow calls;
nest::iaf_psc_alpha::handle from src/nest-simulator/models/iaf_psc_alpha.cpp: 25.75%, similar to nest::iaf_psc_exp::handle;
nest::Connector::send from src/nest-simulator/nestkernel/connector_base.h: 6.60%, see above, but without STDP this time (static weights);
exp from libm: 5.39%.

Compared to 1. cuba, different neuron model without STDP. The main bottleneck shifts to Poisson distribution random generation; the rest is typical SNN simulation.

3. Artificial¶

Hotspot functions:

nest::iaf_psc_alpha_ps::update from src/nest-simulator/models/iaf_psc_alpha_ps.cpp: 13.26%, neuron state update;
nest::iaf_psc_alpha::update from src/iaf_psc_alpha.cpp: 12.37%, see above;
nest::Connector::send from src/nest-simulator/nestkernel/connector_base.h: 7.19%, see above, still no STDP (static weights);
nest::SimulationManager::update_ from src/nest-simulator/nestkernel/simulation_manager.cpp: 5.66%, core SNN simulation loop calling the above functions;
__ieee754_pow_fma from libm: 5.17%, see above.

Summary¶

nest is a flexible SNN simulator, but single-threaded performance is mediocre since most effort goes into multi-core/multi-thread optimization. Unsurprisingly, nest's neuron update code isn't vectorized, while spike propagation and STDP are inherently hard to optimize. This is a floating-point application that's difficult to vectorize; as the counters show, zero vector floating-point instructions are executed.

772.marian_r¶

marian_r is a neural-network-based translator. Another neural network inference workload, meaning -O3 -march=native should have a large advantage. If dedicated hardware acceleration instructions are available (like in 706.stockfish_r), performance will far exceed -O3. Two workloads:

# 1. TildeMODEL marian-decoder --cpu-threads 1 -m model.alphas.npz -v vocab.spm vocab.spm --beam-size 1 --mini-batch 32 --maxi-batch 100 --maxi-batch-sort src -w 512 --skip-cost --gemm-type intgemm8 --intgemm-options precomputed-alpha standard-only --quiet --quiet-translation -i TildeMODEL-spec.en --log TildeMODEL-spec.log --log-level off -o TildeMODEL-spec.out # 2. EuroPat marian-decoder --cpu-threads 1 -m model.alphas.npz -v vocab.spm vocab.spm --beam-size 1 --mini-batch 32 --maxi-batch 100 --maxi-batch-sort src -w 512 --skip-cost --gemm-type intgemm8 --intgemm-options precomputed-alpha standard-only --quiet --quiet-translation -i EuroPat-spec.en --log EuroPat-spec.log --log-level off -o EuroPat-spec.out

reftime is 1579s. Compiler and flag comparison:

Compiler + Flags	Time (s)	Score	Improvement over GCC 14 `-O3` (%)	1. TildeMODEL (s)	2. EuroPat (s)
GCC 14 `-O3`	235.2	6.71	0	88.8	146.4
GCC 14 `-O3 -march=native`	78.4	20.14	200	28.2	50.3
GCC 15 `-O3`	150.1	10.52	57	56.0	94.8
GCC 15 `-O3 -march=native`	77.5	20.37	203	27.8	49.7

-O3 -march=native provides a massive 200% improvement. On Apple M1 it's 47%, on Apple M2 it reaches 92%. This level of improvement was previously only seen in 706.stockfish_r. GCC 15 also significantly improves over GCC 14 under -O3.

1. TildeMODEL¶

Hotspot functions:

marian::cpu::integer::affineOrDotTyped from src/marian/tensors/cpu/intgemm_interface.h: 82.28%, mainly in tiled_gemm, performing integer matrix multiplication: uint8_t matrix A multiplied by int8_t matrix B, accumulated to int32_t, finally converted to float and added to float matrix C;
marian::cpu::ProdBatched from src/marian/tensors/cpu/prod.cpp: 10.30%, core is sgemm (actual floating-point matrix operations), compiled to scalar SSE floating-point rather than vector, but given its time share, this is tolerable.

The main hotspot has the same computation pattern as 706.stockfish_r's NNUE. With -O3 -march=native, AVX-VNNI's vpdpbusd instruction optimizes it (see Godbolt). Similarly, GCC 15 performs better than GCC 14 due to its superior unsigned extension implementation. For detailed discussion, see the 706.stockfish_r section in the INT Rate article.

Performance counter comparison:

Compiler + Flags	Time (s)	Insns (B)	Load (B)	Store (B)	Branch (B)	FP Scalar (B)	FP Vector (B)	128-bit Int Vec (B)	256-bit Int Vec (B)
GCC 14 `-O3`	88.2	2038.9	217.8	57.8	53.2	58.7	2.1	514.6	0.0
GCC 14 `-O3 -march=native`	27.6	423.0	131.5	25.1	47.4	59.8	1.1	12.8	47.4
GCC 15 `-O3`	55.6	1353.5	173.9	22.1	53.2	58.7	2.1	184.7	0.0
GCC 15 `-O3 -march=native`	27.3	415.1	128.9	23.5	47.5	59.8	1.1	12.8	47.4

128-bit integer vector from int_vec_retired.128bit counter, 256-bit from int_vec_retired.256bit.

2. EuroPat¶

Hotspot functions:

marian::cpu::integer::affineOrDotTyped: 78.96%, see above;
marian::cpu::ProdBatched: 14.25%, see above.

Identical hotspots to 1. TildeMODEL; the same analysis applies. Performance counters:

Compiler + Flags	Time (s)	Insns (B)	Load (B)	Store (B)	Branch (B)	FP Scalar (B)	FP Vector (B)	128-bit Int Vec (B)	256-bit Int Vec (B)
GCC 14 `-O3`	145.6	3352.7	370.4	89.7	98.8	123.8	3.6	815.0	0.0
GCC 14 `-O3 -march=native`	49.7	777.2	228.7	36.6	88.3	123.9	1.7	19.9	72.6
GCC 15 `-O3`	94.2	2268.5	301.7	33.1	98.8	123.8	3.6	293.6	0.0
GCC 15 `-O3 -march=native`	49.0	765.3	225.2	34.3	88.3	123.9	1.7	19.9	72.6

Summary¶

772.marian_r is essentially a 706.stockfish_r NNUE clone. The hotspot is int8_t times uint8_t accumulated to int32_t matrix multiplication, with more integer vector instructions than floating-point. It probably should be expelled from SPEC FP 2026 Rate.

782.lbm_r¶

lbm stands for Lattice Boltzmann Method, another fluid dynamics application, still Stencil. Single workload:

lbm_r 900 reference.dat 0 0 200_200_130_ldc.of

reftime is 573s. Performance comparison:

Compiler + Flags	Time (s)	Score	Improvement over GCC 14 `-O3` (%)	Insns (B)	Load (B)	Store (B)	Branch (B)	FP Scalar (B)
GCC 14 `-O3`	105.8	5.42	0	2232.2	473.3	242.4	14.5	1108.2
GCC 14 `-O3 -ffast-math`	95.8	5.98	10	1892.4	419.2	192.8	14.5	1009.5
GCC 14 `-O3 -march=native`	131.0	4.37	-19	1669.6	550.3	309.8	14.5	1228.8
GCC 15 `-O3`	105.2	5.45	0.6	2218.9	468.9	242.4	14.5	1108.2
GCC 15 `-O3 -march=native`	111.0	5.16	-5	1777.3	509.8	282.9	14.5	1108.2
GCC 16 `-O3`	105.4	5.44	0.4	2218.9	468.9	242.4	14.5	1108.2
GCC 16 `-O3 -march=native`	110.6	5.18	-4	1777.3	509.8	282.9	14.5	1108.2

The sole hotspot function is LBM_performStreamCollideTRT from src/lbm.c, accounting for 99.35% of time. Its structure is: read from current-round Grid, heavy floating-point computation, write to next-round Grid, with conditional branches in between. Memory access is strided, making vectorization difficult; all generated instructions are SSE scalar. For such scalar-compute-intensive cases, -O3 -ffast-math typically helps by reordering computations and reusing intermediate results.

-O3 -march=native actually regresses performance. GCC 14 regresses worst (-19%); GCC 15/16 regress less but still underperform -O3. Assembly analysis suggests increased stack memory access instructions offset the FMA instruction count reduction benefit (see Godbolt). Note that FMA instructions are counted twice in the FP scalar column but only once in total instruction count.

Discussion¶

Compiler Flags Comparison¶

Overall, compiler flags have significant impact on SPEC FP 2026 Rate performance:

-march=native provides solid improvement for many benchmarks. AVX2 not only widens vectors compared to SSE but also adds many useful instructions that reduce instruction count, plus AVX-VNNI specifically benefits 772.marian_r;
-ffast-math also helps notably, especially since SPEC FP 2026 Rate has substantial floating-point computation. Strictly following source code computation order is often slower than optimized ordering. However, -ffast-math may produce results not conforming to IEEE 754;
-flto and -ljemalloc have minimal effect on most SPEC FP 2026 Rate benchmarks, though they slightly help 748.flightdm_r.

Other common flags like -static and -fomit-frame-pointer haven't been extensively tested yet.

Branch Prediction¶

Only 731.astcenc_r and 737.gmsh_r have notably high MPKI in SPEC FP 2026 Rate; others peak at 767.nest_r's 0.87. 731.astcenc_r's high MPKI is entirely due to GCC 14's poor compilation. Switching to LLVM 22 immediately normalizes it. Hopefully GCC will address this.

Conclusion¶

This article provides in-depth analysis of SPEC CPU 2026 FP Rate workloads, for reference by compiler and processor designers. From a compiler perspective, combining the strengths of both GCC and LLVM can further improve performance. From a processor perspective, optimizing for program bottlenecks can further increase scores.

SPEC CPU 2026 负载特性分析（FP Rate 篇）

Fri, 29 May 2026 00:00:00 +0000

SPEC CPU 2026 负载特性分析（FP Rate 篇）¶

English version

背景¶

继 INT Rate 篇后，本文继续分析 SPEC FP 2026 Rate 的负载特性。

测试环境与先前的 INT Rate 篇相同，这里不再赘述。

SPEC FP 2026 Rate 分析¶

709.cactus_r¶

Cactus 是一个计算框架，这里用它来求解真空中的爱因斯坦方程。命令参数如下：

cactus ShiftedGaugeWave.par

实测数据显示，运行时间为 103.4s，reftime 是 858s，对应 8.30 分。不同编译器和编译选项对 709.cactus_r 的优化情况如下：

编译器 + 选项	时间 (s)	分数	相比 GCC 14 `-O3` 性能提升 (%)
GCC 14 `-O3`	103.4	8.30	0
GCC 14 `-O3 -march=native`	83.9	10.23	23
GCC 14 `-O3 -ffast-math`	101.2	8.48	2
GCC 14 `-O3 -ljemalloc`	100.7	8.52	3
LLVM 22 `-O3`	94.6	9.07	9
LLVM 22 `-O3 -march=native`	90.5	9.48	14

可见 -march=native 能提供巨大的性能提升，LLVM 22 在 -O3 下比 GCC 14 快，不过 GCC 14 的 -O3 -march=native 又反超了 LLVM 22 的 -O3 -march=native，后面会具体分析。

通过 perf 观察性能瓶颈：

ML_CCZ4::ML_CCZ4_EvolutionInteriorSplitBy2_Body 来自 src/repos/mclachlan/ML_CCZ4/src/ML_CCZ4_EvolutionInteriorSplitBy2.cc：占总时间 41.30%，下同；
ML_CCZ4::ML_CCZ4_EvolutionInteriorSplitBy3_Body 来自 src/repos/mclachlan/ML_CCZ4/src/ML_CCZ4_EvolutionInteriorSplitBy3.cc：31.26%；
ML_CCZ4::ML_CCZ4_ConstraintsInterior_Body 来自 src/repos/mclachlan/ML_CCZ4/src/ML_CCZ4_ConstraintsInterior_Body.cc：6.71%；
ML_CCZ4::ML_CCZ4_EvolutionInteriorSplitBy1_Body 来自 src/repos/mclachlan/ML_CCZ4/src/ML_CCZ4_EvolutionInteriorSplitBy3.cc：6.44%。

这些热点函数的代码模式都是类似的：在三层循环里，读取对应三维空间中的点的数据，进行一系列的 Stencil 访存和浮点运算，包括大量的浮点乘法加法减法、pow 和 fabs，最后把结果写入对应数组。从指令来看，就是用大量的 SSE 指令来进行标量的双精度浮点运算，没有进行向量化。实验的时候，还观察到了编译器对 pow 和 fabs 的优化。在 -O3 时，pow(a, 1) 被编译成 a，pow(a, 2) 被编译成 a * a，pow(a, -1) 被编译成 1.0 / a，不过其他的例如 pow(a, 3) 和 pow(a, -2) 就只能转为 libm 的 pow 实现了。如果开了 -O3 -ffast-math，那么 pow(a, 3) 会编译成 a * a * a，pow(a, -2) 会被编译为 1.0 / (a * a)。两种编译选项的对比见 Godbolt。代码中，出现的主要就是 pow(a, -1)，pow(a, 2)、pow(a, -2) 和 pow(a, runtimeVariable)，其中 runtimeVariable 指一个在运行时才知道的数，在代码中对应 shiftAlphaPower 或 harmonicN。fabs 被编译成了位运算 andpd 指令，直接把符号位置零。

开启 -O3 -march=native 后，其实依然没有向量化，用 AVX2 指令计算双精度标量浮点，依然能看到对 libm 的 pow 的调用，就是上面提到的 pow(a, -2) 或 pow(a, runtimeVariable)，不过其余的计算部分因为能用 vfmadd132sd/vfnmadd132sd 而获得了性能提升，同时 vaddsd 相比 addsd 从两操作数变为三操作数，还允许访存，进一步节省了指令数。而在 ARM64 平台上，开 -march=native 就没有性能提升，这是因为它的浮点乘加融合指令即使在没开 -march=native 的情况下也是可以使用的，见 Godbolt。某种意义上来说，AMD64 上开 -march=native 有性能巨大提升，也是吃了先发劣势的亏：基线对应的处理器太早，缺少很多重要的指令集扩展，这种兼容性负担在很多其他指令集上不会出现，例如乘加融合 FMA 指令很多指令集里已经在基线当中，在这些指令集上，开 -march=native 的提升就会相对来说更低。所以现在很多软件会曲线救国，为了保证兼容性，针对多个不同指令集扩展分别做手动适配，在运行时根据可用性选择性能最好的那一个。如果编译器能很好地自动完成这一点，将会在保持兼容性和开发便捷性的前提下，带来不错的系统整体性能提升。

不同编译选项的情况对比：

编译器 + 选项	时间 (s)	指令 (B)	Load (B)	Store (B)	分支 (B)	浮点标量 (B)	浮点向量 (B)
GCC 14 `-O3`	103.4	1423.6	747.8	110.1	9.8	677.0	5.2
GCC 14 `-O3 -march=native`	83.9	988.5	711.9	89.5	8.9	686.1	2.6
GCC 14 `-O3 -ffast-math`	101.8	1387.7	742.2	103.4	5.3	641.0	5.6
GCC 14 `-O3 -ljemalloc`	100.7	1423.6	747.8	110.1	9.8	677.0	5.2
LLVM 22 `-O3`	94.6	1323.1	659.1	96.6	6.1	659.0	15.2
LLVM 22 `-O3 -march=native`	90.5	1054.5	690.7	119.4	5.4	681.4	5.4

其中总指令数来自 instructions，Load 指令数来自 mem_inst_retired.all_loads，Store 指令数来自 mem_inst_retired.all_stores，分支指令数来自 branch-instructions，浮点标量指令数用 fp_arith_inst_retired.scalar，浮点向量指令数用 fp_arith_inst_retired.vector 性能计数器，下同。需要注意的是，vfmadd132sd 等乘加融合指令在 fp_arith_inst_retired.scalar/vector 计数器中会被计算两次。

从表里可以看出，-O3 下基本是一半指令在 Load，另一半指令在做浮点标量运算，这个计算访存比还是挺低的，这是 Stencil 计算的典型特征，在网格邻域里，Load 一个值进来，做一次乘加。开 -O3 -march=native 后，因为乘加融合指令的加持，指令数减少了很多，但因为乘加融合会算两倍的贡献，并且那些同时进行访存和计算的 AVX2 指令也会被同时计入到 Load 和浮点指令数，估计微架构是统计的拆分后的微码数量，那么总指令数不再等于各类指令数求和。这里 -O3 -ljemalloc 带来了些许的性能优势，不过指令数上并没有体现，它的性能提升主要是来自缓存局部性的改进。GCC 14 和 LLVM 22 在不同编译选项下各有千秋，大概看了一下生成的指令，其实实现方法都差不多，主要是地址计算、栈的使用和寄存器分配有一些区别。

值得注意的是，709.cactus_r 的缓存缺失率较高：GCC 14 -O3 下，L1 ICache 的 MPKI 达到 118.6B/1423.6B*1000=83.30，L1 DCache 也有 125.6B/1423.6B*1000=88.23 的 MPKI，在 SPEC FP 2026 Rate 和 SPEC INT 2026 Rate 中都是最高的。因此 L1 ICache 更大的核心更占优势，32KB 时遇到的 L1 ICache 瓶颈，换成 64KB 可能就消失了。开 -O3 -ljemalloc 后，L1 DCache 的 MPKI 降低到 111.7B/1423.6B*1000=78.46，在指令数与 -O3 持平的情况下获得了约 3% 的性能提升。

722.palm_r¶

palm 是一个天气预报相关的程序，做的是 Navier Stokes 方程的求解，命令如下：

palm_r < runfile_atmos

实测数据显示，运行时间为 174.0s，reftime 是 1320s，对应 7.59 分。不同编译器和编译选项对 722.palm_r 的优化情况：

编译器 + 选项	时间 (s)	分数	相比 GCC 14 `-O3` 性能提升 (%)
GCC 14 `-O3`	174.0	7.59	0
GCC 14 `-O3 -march=native`	157.8	8.34	10
GCC 14 `-O3 -ffast-math`	168.4	7.84	3
GCC 14 `-O3 -ljemalloc`	172.4	7.66	1
LLVM 22 `-O3`	144.0	9.17	21
LLVM 22 `-O3 -march=native`	118.6	11.13	47

趋势和 709.cactus_r 类似，-O3 -march=native 对性能提升巨大，LLVM 22 也明显比 GCC 14 快。

热点函数：

advec_s_ws_ij 来自 src/advec_ws.F90：9.80%，经典的 3 维上的 Stencil 计算，访存和计算的比例接近，基本是 load 一个点的数值然后就做对应的乘加，用 SSE 指令来做计算，有部分向量化计算，例如 addpd/subpd/mulpd 等，每条指令处理 2 个双精度浮点元素，不过也有一些循环没能成功向量化，退化到 addsd/subsd/mulsd 等浮点标量指令；
advec_u_ws_ij 来自 src/advec_ws.F90：8.80%，同上；
advec_v_ws_ij 来自 src/advec_ws.F90：8.54%，同上；
advec_w_ws_ij 来自 src/advec_ws.F90：8.24%，同上；
diffusion_e_ij 来自 src/turbulence_closure_mod.F90：5.14%，有一些比较复杂的浮点运算，比如 min/sqrt/div 等等，还有位运算，用 MERGE 来进行 ternary operator，无向量化，用 SSE 指令来做标量浮点计算。

以下是 advec_s_ws_ij 中的 Stencil 计算代码，按 i,j,k 的顺序进行三层循环：

flux_r(k) = u_comp * ( &  37.0_wp * ( sk(k,j,i+1) + sk(k,j,i) ) &  - 8.0_wp * ( sk(k,j,i+2) + sk(k,j,i-1) ) &  + ( sk(k,j,i+3) + sk(k,j,i-2) ) ) * adv_sca_5

不同编译选项的情况对比：

编译器 + 选项	时间 (s)	指令 (B)	Load (B)	Store (B)	分支 (B)	浮点标量 (B)	浮点向量 (B)
GCC 14 `-O3`	174.0	3416.6	1267.4	271.1	155.6	779.0	318.5
GCC 14 `-O3 -march=native`	157.8	2710.0	1212.8	242.5	147.1	785.9	172.6
GCC 14 `-O3 -ffast-math`	168.4	3373.5	1204.7	278.0	134.0	612.8	363.1
GCC 14 `-O3 -ljemalloc`	172.4	3368.4	1259.7	260.7	141.6	779.0	318.5
LLVM 22 `-O3`	144.0	2640.4	835.5	216.3	90.4	179.5	609.7
LLVM 22 `-O3 -march=native`	118.6	1643.8	586.5	165.6	67.6	180.8	306.7

开 -O3 -march=native 后，能看到大量的 AVX2 向量化指令：vmulpd/vdivsd/vaddpd/vsubpd/vfmadd213sd/vfmsub132pd/vfmsub231pd/vmovupd 等等，每次处理 4 个双精度浮点元素，向量化程度很高，如果放在支持 AVX512 的处理器上，性能可能还会更高。相比 709.cactus_r 被 pow 等问题限制没能向量化，722.palm_r 的向量化收益要明显得多。LLVM 22 在 -O3 下比 GCC 14 更好，是因为它在热点函数如 advec_u/v/w_ws_ij 中成功进行了向量化，而 GCC 14 仍用标量，体现在数据上就是浮点向量指令数明显增多，浮点标量指令数明显减少。LLVM 22 下，上述热点函数被优化得较好后，flow_statistics（来自 src/flow_statistics.F90，时间占比 5.79%）成为了新的热点函数。它能向量化的部分比较少，因而时间占比提升。即使开了 -O3 -march=native，也还是用 AVX2+FMA 指令来做标量计算，时间区别不大。其他部分时间降低后，它的时间占比进一步提高到 6.95%，类似 Amdahl 定律。

709.cactus_r 和 722.palm_r 的计算模式其实都是 Stencil。物理相关的模拟经常做这类事情：在三维空间里求解微分方程，数值求解时需要对每个点的邻域进行反复计算，落到最后就是 Stencil。

731.astcenc_r¶

astcenc 是一个针对 ASTC 有损压缩图片格式的编码器，运行三次，命令如下：

# 1. linear astcenc_r ref-inputs-linear.txt # 2. hdr astcenc_r ref-inputs-hdr.txt # 3. precision astcenc_r ref-inputs-precision.txt

实测运行时间为 49.9s、72.1s 和 53.8s，总时间 175.8s，reftime 是 840s，对应 4.78 分。不同编译器和编译选项的优化情况如下：

编译器 + 选项	总时间 (s)	1. linear 时间 (s)	2. hdr 时间 (s)	3. precision 时间 (s)	分数	相比 GCC 14 `-O3` 性能提升 (%)
GCC 14 `-O3`	175.8	49.9	72.1	53.8	4.78	0
GCC 14 `-O3 -march=native`	157.3	44.0	63.2	50.0	5.34	12
GCC 14 `-O3 -ffast-math`	160.5	44.6	67.2	48.7	5.23	10
LLVM 22 `-O3`	134.0	38.5	56.1	39.3	6.27	31
LLVM 22 `-O3 -march=native`	117.2	34.4	48.6	34.1	7.17	50

又是 LLVM 22 相比 GCC 14 有明显优势的一个基准测试。其他对性能几乎没有影响的优化选项包括 -flto 和 -ljemalloc，这里就不具体列举了。731.astcenc_r 是 SPEC FP 2026 Rate 中 MPKI 最高的那一个，高达 5.0，相比其他大多数不到 1.0 的 MPKI 来说很高（第二高的是 737.gmsh_r，MPKI 达到了 3.33，第三高 767.nest_r 的 MPKI 只有 0.83），也比 SPEC INT 2026 Rate 的不少基准测试更高。下面分负载来进行分析。

1. linear¶

主要热点函数：

compute_angular_endpoints_for_quant_levels 来自 src/astcenc_weight_align.cpp：18.93%，主要瓶颈是在中间的循环，在用 SSE 做一些单精度浮点的标量计算，中间还有一些对来自 libm 的 nearbyint 调用，进行 round 操作，从代码来看，开发者有意识地写一些适合编译器去向量化的代码，比如用 vfloat4 类型来做一些批量操作，还有 vmask4 类型保存 vfloat4 比较的结果（vmask4 保存了四个 int，用 0 代表 false，用 -1 代表 true），再用 select 函数来进行向量化的 ternary operator，可惜编译器并不领情，编译出来依然是标量 SSE；
compute_avgs_and_dirs_3_comp_rgb 来自 src/astcenc_averages_and_directions.cpp：14.70%，模式和上面类似，在循环中做一些 vfloat4 和 vmask4 的计算，但 SSE 指令都是标量的；
compute_quantized_weights_for_decimation 来自 src/astcenc_ideal_endpoints_and_weights.cpp：13.34%，在循环中做一些不过因为涉及到量化，有一些 vint 参与以及查表 vtable_lookup_32bit，这里 vfloat/vint 本来代表的是根据平台能提供的 SIMD 宽度进行一个自动的映射（定义在 src/astcenc_vecmathlib.h 中，比如 AVX 就是 8 个元素，vfloat 映射到 vfloat8；SSE 就是 4 个元素，vfloat 映射到 vfloat4），不过显然这些在 SPEC 里都被禁用了，fallback 到了 4 个元素的情况；
compute_ideal_weights_for_decimation 来自 src/astcenc_ideal_endpoints_and_weights.cpp：9.57%，主要瓶颈是在一个 gather 操作 gatherf_byte_inds 里，不过因为 SSE 不支持 gather，所以是拆成四个元素分别进行 load 和标量计算的；
bilinear_infill_vla 来自 src/astcenc_ideal_endpoints_and_weights.cpp：7.80%，瓶颈一样是 gather，即 gatherf_byte_inds 函数；
compute_error_squared_rgb 来自 src/astcenc_averages_and_directions.cpp：6.39%，瓶颈一样是 gather，以及 gather 之后的一系列向量计算，但 GCC 14 都编译成了 SSE 标量计算。

原生 SIMD 写法编译出来却是标量指令，反过来也说明，如果能正确向量化，性能还会有明显的提升空间。进一步，如果开了 -O3 -march=native，向量更宽来到 256 位，还多了 vblendvps 指令来实现上述 select 函数。前面提到过，LLVM 22 明显更快，下面看看不同编译器和编译选项的对比：

编译器 + 选项	时间 (s)	指令 (B)	Load (B)	Store (B)	分支 (B)	浮点标量 (B)	浮点向量 (B)	错误预测 (M)	MPKI
GCC 14 `-O3`	49.9	835.7	259.3	55.6	63.2	188.6	28.6	3136.0	3.75
GCC 14 `-O3 -march=native`	44.0	652.4	234.0	46.3	52.9	184.6	28.5	3148.2	4.83
GCC 14 `-O3 -ffast-math`	44.6	780.5	259.8	54.6	49.3	159.9	43.2	2139.0	2.74
LLVM 22 `-O3`	38.5	829.7	235.0	34.8	36.1	68.8	155.6	1095.5	1.32
LLVM 22 `-O3 -march=native`	34.4	620.9	179.5	17.7	19.6	42.1	125.7	823.4	1.33

从计数器可以看到，GCC 14 整体性能比 LLVM 22 差，是因为 LLVM 22 做了更多的向量化，它浮点向量指令明显比浮点标量要多，并且错误预测明显更少，MPKI 小很多。下面进行深入的分析。

首先看 GCC 14 是怎么实现 731.astcenc_r 的这类 SIMD 原生代码的。以上面分析的热点函数为例，一个常见的模式是用 vfloat4 的比较加 select 来实现向量化的最大值计算：

vfloat4 vmax(vfloat4 a, vfloat4 b) {  vmask4 mask = b > a;  return select(a, b, mask); }

这段代码在 -O3 编译选项下会被 GCC 14 编译成这样的汇编：

vmax(vfloat4 a, vfloat4 b):  # a 向量保存在 xmm0（a[0] 和 a[1]）和 xmm1（a[2] 和 a[3]）寄存器  # b 向量保存在 xmm2（b[0] 和 b[1]）和 xmm3（b[2] 和 b[3]）寄存器  # 虽然每个元素都是单精度，但每个 xmm 寄存器只保存了两个元素  movq %xmm1, %rax # rax = a3 | a2  movq %xmm3, %rcx # rcx = b3 | b2  movq %xmm0, %rsi # rsi = a1 | a0  movd %ecx, %xmm1 # xmm1 = b2  movd %eax, %xmm6 # xmm6 = a2  shrq $32, %rcx # rcx = b3  movdqa %xmm2, %xmm5 # xmm5 = b1 | b0  shrq $32, %rax # rax = a3  movdqa %xmm2, %xmm0 # xmm0 = b1 | b0  movd %ecx, %xmm4 # xmm4 = b3  shufps $85, %xmm5, %xmm5 # xmm5 = b1 | b1 | b1 | b1  movd %eax, %xmm2 # xmm2 = a3  movd %esi, %xmm7 # xmm7 = a0  shrq $32, %rsi # rsi = a1  movdqa %xmm5, %xmm3 # xmm3 = b1 | b1 | b1 | b1  comiss %xmm2, %xmm4 # 比较 a3 和 b3  movd %esi, %xmm5 # xmm5 = a1  seta %al # al = (b3 > a3)  comiss %xmm6, %xmm1 # 比较 b2 和 a2  jbe .L14 # 如果 a2 >= b2 就跳转到 .L14  testb %al, %al  jne .L15 # 如果 b3 > a3 就跳转到 .L15  # 此时 a2 < b2, a3 >= b3  maxss %xmm7, %xmm0 # xmm0 = max(a0, b0)  maxss %xmm5, %xmm3 # xmm3 = max(a1, b1)  unpcklps %xmm2, %xmm1 # xmm1 = a3 | b2  unpcklps %xmm3, %xmm0 # xmm0 = max(a1, b1) | max(a2, b2)  ret .L14: # 处理 a2 >= b2 的情况  testb %al, %al  jne .L16 # 如果 b3 > a3 就跳转到 .L16  #3 此时 a2 >= b2, a3 >= b3  movaps %xmm6, %xmm1 # xmm1 = a2  # 下略，就是分类讨论 a2 vs b2，a3 vs b3 的四种情况 .L17:  maxss %xmm7, %xmm0  maxss %xmm5, %xmm3  unpcklps %xmm2, %xmm1  unpcklps %xmm3, %xmm0  ret .L16:  movaps %xmm4, %xmm2  movaps %xmm6, %xmm1  jmp .L17 .L15:  maxss %xmm7, %xmm0  maxss %xmm5, %xmm3  movaps %xmm4, %xmm2  unpcklps %xmm2, %xmm1  unpcklps %xmm3, %xmm0  ret

很奇怪的是，它首先用通用寄存器把输入的数值拆分出来，然后分别比较后两个元素 a2 vs b2，a3 vs b3，用分支来处理四种可能的情况，这四种情况是已知后两个元素最大值都来自哪里，结果针对前两个元素又用 maxss 来计算，为啥不一开始就对所有四个元素都用 maxss 呢？结果开 -O3 -ffast-math 后，它莫名其妙就学会了这一点：

vmax(vfloat4, vfloat4):  movq %xmm0, %rsi  movq %xmm1, %rcx  movq %xmm2, %rdx  movd %esi, %xmm1  movq %xmm3, %rax  movdqa %xmm2, %xmm0  shrq $32, %rdx  maxss %xmm1, %xmm0  shrq $32, %rsi  movdqa %xmm3, %xmm1  shrq $32, %rax  movd %ecx, %xmm3  shrq $32, %rcx  movd %edx, %xmm2  movd %esi, %xmm4  maxss %xmm3, %xmm1  movd %ecx, %xmm5  movd %eax, %xmm3  maxss %xmm4, %xmm2  maxss %xmm5, %xmm3  unpcklps %xmm2, %xmm0  unpcklps %xmm3, %xmm1  ret

但依然是用 SSE 做标量，而 LLVM 22 就懂得如何用 maxps 指令向量化：

vmax(vfloat4, vfloat4):  movlhps %xmm3, %xmm2  movlhps %xmm1, %xmm0  maxps %xmm2, %xmm0  movaps %xmm0, %xmm1  unpckhpd %xmm0, %xmm1  retq

剩余的指令只是为了解决调用约定的数据存放位置问题，实际在函数内部计算的时候，通常就一条 maxps 指令完成所有 4 个元素的 max 计算。从这个例子也可以看出，为啥 LLVM 22 比 GCC 14 要快得多：GCC 14 多了很多无用的分支来解决 select 里的比较，而且还不能向量化 max 操作。即使给 GCC 14 开 -march=native，它依然还在用 AVX 指令进行标量 max 运算，真是难绷。上述编译结果可见 Godbolt。GCC 14 的 MPKI 那么高，其实都是这么来的，也挺搞笑。我还测试了一下，发现相同的代码在 LoongArch 下也没有得到很好的向量化支持（见 Godbolt），因此提了一个 issue，仅考虑向量化 fmax 内核，用 vfcmp.slt.s + vbitsel.v 的优化实现大概是目前 LLVM 22 编译结果的 2.9 倍性能。这里有一个小冷知识，就是 x86 的 SSE/AVX max 指令都实现的都是 a > b ? a : b 的逻辑，而 LoongArch 的 fmax 指令实现的是 IEEE754 的 maxNum，二者在出现 NaN 时的行为不同：前者只要 a 或 b 出现一个 NaN，就都返回 b；后者只有一个 NaN 时，会返回另一个非 NaN 的数。

2. hdr¶

主要热点函数：

compute_angular_endpoints_for_quant_levels 来自 src/astcenc_weight_align.cpp：19.80%，描述见上；
compute_avgs_and_dirs_3_comp_rgb 来自 src/astcenc_averages_and_directions.cpp：15.37%，描述见上；
compute_quantized_weights_for_decimation 来自 src/astcenc_ideal_endpoints_and_weights.cpp：12.40%，描述见上；
compute_error_squared_rgb 来自 src/astcenc_averages_and_directions.cpp：6.91%，描述见上；
compute_ideal_weights_for_decimation 来自 src/astcenc_ideal_endpoints_and_weights.cpp：5.68%，描述见上。

热点函数基本和 1. linear 一致，那么各方面基本也和它一样，GCC 14 生成大量分支和标量 SSE 指令，而 LLVM 22 能更好地向量化，避免一些无谓的分支。对比如下：

编译器 + 选项	时间 (s)	指令 (B)	Load (B)	Store (B)	分支 (B)	浮点标量 (B)	浮点向量 (B)	错误预测 (M)	MPKI
GCC 14 `-O3`	72.1	1091.8	306.9	78.6	91.7	245.8	30.4	4928.9	4.51
GCC 14 `-O3 -march=native`	63.1	851.4	271.2	65.2	77.4	240.1	30.4	4890.6	5.74
GCC 14 `-O3 -ffast-math`	67.1	1036.6	311.0	85.5	73.7	200.8	54.3	4077.0	3.93
LLVM 22 `-O3`	55.9	1107.9	276.5	55.9	56.9	111.8	129.9	1943.2	1.75
LLVM 22 `-O3 -march=native`	48.6	825.2	209.3	30.7	34.1	85.2	139.7	1411.6	1.71

3. precision¶

热点函数大多还是和 1. linear 以及 2.hdr 一样，就是多了一个 find_best_partition_candidates 函数，来自 src/astcenc_find_best_partitioning.cpp，主要瓶颈在 a / sqrt(length) 的计算上。这次 GCC 14 在 -O3 时倒是能够正确向量化这一步，通过一次标量的 sqrtss 加 shufps 把结果复制到所有 lane，再用 divps 进行批量的除法，不过其余的热点函数还是一如既往的编译出很慢的代码。下面给出性能计数器上的对比：

编译器 + 选项	时间 (s)	指令 (B)	Load (B)	Store (B)	分支 (B)	浮点标量 (B)	浮点向量 (B)	错误预测 (M)	MPKI
GCC 14 `-O3`	53.8	711.5	176.8	62.0	61.3	177.0	9.3	5119.2	7.19
GCC 14 `-O3 -march=native`	49.2	570.5	161.3	57.1	54.7	176.1	9.2	5113.1	8.96
GCC 14 `-O3 -ffast-math`	48.7	655.9	168.3	64.6	49.8	156.5	19.5	4227.6	6.56
LLVM 22 `-O3`	39.3	729.9	149.2	42.8	35.9	75.3	77.2	1906.7	2.61
LLVM 22 `-O3 -march=native`	34.1	544.9	112.5	28.0	23.2	52.0	87.1	1445.7	2.65

小结¶

731.astcenc_r 用了 SIMD 原生的写法来编程：vfloat4、vint4 和 vmask4 等等，编写时就是奔着 SIMD 指令去的。只可惜 GCC 14 辜负了开发者的期望，不能正确识别代码意图并利用硬件指令，还莫名生成了一堆分支来实现 select 函数。相比之下，LLVM 22 就做得好很多，该向量化的地方就向量化。同时也能看到，像 LoongArch 这样稍微小众一些的指令集，在这些代码模式下的优化还比较欠缺，无论 GCC 还是 LLVM 都是如此。

736.ocio_r¶

ocio 是 OpenColorIO 的缩写，和 731.astcenc_r 类似，也是在图片上的处理，不过更侧重于图像处理，而非图像压缩。该基准测试包括如下四个负载：

# 1. lut1d ocioperf --spec-validation-offset 101 --spec-validation-stride 17 --spec-validation-pixels 131 --bitdepths ui16 ui16 --iter 100 --test -1 --transform ctf/lut1d_halfdom.ctf # 2. mntr ocioperf --spec-validation-offset 202 --spec-validation-stride 19 --spec-validation-pixels 132 --bitdepths ui16 f32 --iter 200 --8kres --test 0 --transform ctf/mntr_srgb_identity.ctf # 3. aces ocioperf --spec-validation-offset 303 --spec-validation-stride 23 --spec-validation-pixels 133 --bitdepths f32 f32 --iter 20 --8kres --test -1 --transform clf/aces_to_video_with_look.clf # 4. heavy ocioperf --spec-validation-offset 404 --spec-validation-stride 29 --spec-validation-pixels 134 --bitdepths f32 f32 --iter 25 --test -1 --transform clf/heavy_transform.clf

reftime 是 875s，不同编译器和编译选项的运行情况如下：

编译器 + 选项	总时间 (s)	1. lut1d 时间 (s)	2. mntr 时间 (s)	3. aces 时间 (s)	4. heavy 时间 (s)	分数	相比 GCC 14 `-O3` 性能提升 (%)
GCC 14 `-O3`	139.8	6.1	11.2	67.8	54.6	6.26	0
GCC 14 `-O3 -march=native`	105.0	4.2	10.2	49.6	40.1	8.33	33
GCC 14 `-O3 -ffast-math`	139.4	6.4	11.4	67.8	53.9	6.28	0.3
LLVM 22 `-O3`	128.9	6.8	11.3	61.7	49.0	6.79	8
LLVM 22 `-O3 -march=native`	105.3	5.4	9.6	49.3	40.9	8.31	33

可见又是一个 -O3 -march=native 带来明显提升的基准测试，且 LLVM 22 依然比 GCC 14 在 -O3 下有性能优势，在 -O3 -march=native 时基本打平。下面进行具体分析。

1. lut1d¶

热点函数：

OpenColorIO_v2_2dev::BitDepthCast<BIT_DEPTH_F32, BIT_DEPTH_UINT16>::apply 来自 src/ASWF-OpenColorIO/src/OpenColorIO/CPUProcessor.cpp：45.16%，主要做的计算是，在循环中对取值在零到一之间的单精度浮点元素，乘以 65535 从而放缩到 uint16_t 的范围，加 0.5 后 clamp 到 uint16_t 的范围，最后再 float 转换为 uint16_t，这个过程被编译为 SSE 的向量指令；
OpenColorIO_v2_2dev::Lut1DRendererHalfCode<BIT_DEPTH_UINT16, BIT_DEPTH_F32>::apply 来自 src/ASWF-OpenColorIO/src/OpenColorIO/ops/lut1d/Lut1DOpCPU.cpp：33.70%，在循环中对输入的 uint16_t 进行查表，其实就是从预先计算好的数组里读取 uint16_t 对应的 float 值，瓶颈是 SSE 标量间接访存；
__memmove_avx_unaligned_erms 来自 libc：13.28%，memmove 的 AVX 加速实现；
__memset_avx2_unaligned_erms 来自 libc：3.55%，memset 的 AVX 加速实现。

对于这类可以高度向量化的代码，-O3 -march=native 的提升是很明显的，在 OpenColorIO_v2_2dev::BitDepthCast<BIT_DEPTH_F32, BIT_DEPTH_UINT16>::apply 函数里，体现就是用上了 AVX2 的 256 位向量计算以及 FMA 指令，正好把放缩和加 0.5 这两步融合在了一起，后续则是继续用位运算来实现 clamp 操作，使得这个函数在 -O3 -march=native 下的时间占比降低到了 27.82%，那么依然在用 SSE 标量进行间接访存的 OpenColorIO_v2_2dev::Lut1DRendererHalfCode<BIT_DEPTH_UINT16, BIT_DEPTH_F32>::apply 就成为了主要的性能瓶颈，时间占比提升到 42.85%。

在该基准测试里，GCC 14 比 LLVM 22 更快一些。以下是二者在不同编译选项下的对比：

编译器 + 选项	时间 (s)	指令 (B)	Load (B)	Store (B)	分支 (B)	浮点标量 (B)	浮点向量 (B)	错误预测 (M)
GCC 14 `-O3`	6.1	106.2	23.3	11.7	4.2	2.6	5.0	2.6
GCC 14 `-O3 -march=native`	4.2	63.8	22.0	11.0	3.6	2.6	2.5	2.5
GCC 14 `-O3 -ffast-math`	6.4	104.8	23.2	11.7	4.2	2.5	5.0	2.6
LLVM 22 `-O3`	6.8	106.1	23.3	11.7	3.6	2.5	5.0	2.6
LLVM 22 `-O3 -march=native`	5.4	72.5	24.8	11.0	1.4	2.5	2.5	2.5

具体到汇编层面上，可以观察到，GCC 14 和 LLVM 22 在实现上有一些不同，开头都是乘法和加法，主要是 clamp 的部分用的指令不同，为了解决 16 位和 32 位的位宽转换的问题，GCC 14 主要用 punpcklwd 类指令，而 LLVM 22 更多使用 pshufd 类指令，详见 Godbolt。虽然总指令数很接近，但毕竟硬件执行这些指令需要的时间不同，所以体现在 IPC 上也有一定的差距。开 -O3 -march=native 之后也是类似的情况。

2. mntr¶

热点函数：

OpenColorIO_v2_2dev::BitDepthCast<BIT_DEPTH_UINT16, BIT_DEPTH_F32>::apply 来自 src/ASWF-OpenColorIO/src/OpenColorIO/CPUProcessor.cpp：55.41%，这次转换的方向反过来了，是从 uint16_t 到 float，于是计算过程变成先从 uint16_t 转成 float，再乘以 1.0/65535.0，当然这次就没有 clamp 了，编译器依然能正确向量化，不过因为位宽从 16 变成 32 的问题，花了不少功夫；
OpenColorIO_v2_2dev::ScaleRenderer::apply 来自 src/ASWF-OpenColorIO/src/OpenColorIO/ops/matrix/MatrixOpCPU.cpp：41.52%，代码逻辑就是很简单的对每个像素的四个分量分别乘以一个 scale（从 out[0] = in[0] * m_scale[0] 到 out[3] = in[3] * m_scale[3]），不同像素的 scale 来自同一个数组 m_scale，理应是比较好向量化的，但实际上并没有向量化成功，这是因为指针没有标记 restrict，编译器无法判断 out 和 m_scale 是否可能重合，只有在不重合的前提下，才能直接用 mulps 向量化，见 Godbolt。

由于 AMD64 缺少对混合宽度计算的向量指令，其实很大开销是在向量之间搬运数据，而非进行实际的计算和访存，这方面，RISC-V Vector 的特殊设计还确实带来了更简洁的指令生成，见 Godbolt。不同编译器在不同编译选项下的对比：

编译器 + 选项	时间 (s)	指令 (B)	Load (B)	Store (B)	分支 (B)	浮点标量 (B)	浮点向量 (B)	错误预测 (M)
GCC 14 `-O3`	11.2	209.9	56.5	33.3	7.5	26.8	6.6	1.9
GCC 14 `-O3 -march=native`	10.2	159.6	54.8	29.9	7.1	26.8	3.3	1.8
GCC 14 `-O3 -ffast-math`	11.4	209.7	56.5	33.3	7.5	26.7	6.6	1.8
LLVM 22 `-O3`	11.3	194.5	56.5	33.3	8.6	26.5	6.7	1.9
LLVM 22 `-O3 -march=native`	9.6	149.4	58.2	29.9	2.8	26.5	3.4	2.0

3. aces¶

热点函数：

OpenColorIO_v2_2dev::Lut3DTetrahedralRenderer::apply 来自 src/ASWF-OpenColorIO/src/OpenColorIO/ops/lut3d/Lut3DOpCPU.cpp：50.74%，做的操作还挺复杂，每个元素首先进行一次乘法，然后进行一次 clamp，floor 和 ceil 后分别转化为 int，再根据 int 去进行对一个表进行间接访存，查表的结果再经过一系列的加权平均完成计算，向量化程度不高；
OpenColorIO_v2_2dev::MatrixRenderer::apply 来自 src/ASWF-OpenColorIO/src/OpenColorIO/ops/matrix/MatrixOpCPU.cpp：11.55%，进行矩阵的运算，把输入的四维向量和一个 4x4 矩阵进行乘法，得到输出的四维向量，向量化程度较高；
__log2f_fma 来自 libm：10.02%，计算浮点 log2；
OpenColorIO_v2_2dev::CameraLin2LogRenderer::apply 来自 src/ASWF-OpenCOlorIO/src/OpenColorIO/ops/log/LogOpCPU.cpp：9.76%，判断输入的范围，如果小于一个阈值 m_linb，就用线性的乘加计算结果，否则就会调用上述 log2 函数，结合一些乘加以及 max 操作来进行计算，向量化程度低。

不同编译器和编译选项的对比：

编译器 + 选项	时间 (s)	指令 (B)	Load (B)	Store (B)	分支 (B)	浮点标量 (B)	浮点向量 (B)	错误预测 (M)
GCC 14 `-O3`	67.8	1258.9	299.3	86.3	100.5	260.6	28.0	146.6
GCC 14 `-O3 -march=native`	49.6	873.7	289.0	84.9	84.0	257.4	14.0	135.4
GCC 14 `-O3 -ffast-math`	67.8	1251.5	296.4	94.4	109.9	213.7	43.8	150.6
LLVM 22 `-O3`	61.7	1152.4	416.6	136.7	133.7	329.0	15.4	168.5
LLVM 22 `-O3 -march=native`	49.3	857.8	342.8	92.6	84.4	329.0	13.0	151.6

GCC 14 和 LLVM 22 在 -O3 下的性能差距主要来自于 floor 和 ceil 的处理：GCC 14 生成了一系列 SSE 指令来计算，由于没有 SSE4.1 的 roundps 指令，所以实现比较复杂，而 LLVM 22 转为采用 libm 的加速实现 __floorf_sse41，它的函数体就是一条 SSE4.1 的 roundps 指令加 return，虽然有函数调用的开销，不仅要 call/ret，还多了一些寄存器到栈的 Load 和 Store，但总体还是赚的。不过，如果处理器确实没有 SSE4.1 指令，那么 GCC 14 又该比 LLVM 22 更快了。这种取舍，在不开 -march=native 的时候确实无法实现，此时只能猜测，哪种情况发生的概率更高了，例如现在来看，有 SSE4.1 的 AMD64 处理器肯定是比没有 SSE4.1 的 AMD64 处理器要多。

开 -O3 -march=native 后，因为有了 vroundps 指令，原来的 ceil 和 floor 操作可以用向量指令代替，相比之前的向量化实现（GCC 14）或调用 libm 里的加速实现（LLVM 22），GCC 14 和 LLVM 22 都有不错的提升，来到了同一水平线上。同时 fma 也成功融合了不少浮点乘加计算。

4. heavy¶

热点函数：

__powf_fma 来自 libm：26.17%；
OpenColorIO_v2_2dev::Lut3DRenderer::apply 来自 src/ASWF-OpenColorIO/src/OpenColorIO/ops/lut3d/Lut3DOpCPU.cpp：25.69%，模式和上面的 OpenColorIO_v2_2dev::Lut3DTetrahedralRenderer::apply 比较类似，也有 clamp/floor/ceil 和查表等动作，就是最后的计算部分不太一样，也都是标量的 SSE 指令；
OpenColorIO_v2_2dev::Lut1DRenderer<BIT_DEPTH_F32, BIT_DEPTH_F32>::apply 来自 src/ASWF-OpenColorIO/src/OpenColorIO/ops/lut1d/Lut1DOpCPU.cpp：15.63%，模式和上述 OpenColorIO_v2_2dev::Lut3DRenderer::apply 类似，不过查表的部分更简单，因为只有一维，但也是全程标量；
OpenColorIO_v2_2dev::CDLRendererFwd<true>::apply：10.88%，里面调用了 pow，导致 __powf_fma 占用了很多时间，其余部分做了浮点乘法、加减法以及 Clamp 操作，还是全程标量；
OpenColorIO_v2_2dev::GammaMoncurveOpCPUFwd::apply：5.41%，同样调用了 pow，除了 pow 以外还有一些浮点运算以及比较。

不同编译器和编译选项的对比：

编译器 + 选项	时间 (s)	指令 (B)	Load (B)	Store (B)	分支 (B)	浮点标量 (B)	浮点向量 (B)	错误预测 (M)
GCC 14 `-O3`	54.6	1013.5	209.4	57.0	80.8	253.7	5.8	32.0
GCC 14 `-O3 -march=native`	40.9	764.7	204.0	54.8	70.8	260.2	3.3	31.8
GCC 14 `-O3 -ffast-math`	53.9	971.0	202.1	50.5	80.6	252.3	6.6	29.1
LLVM 22 `-O3`	49.0	861.5	250.4	77.3	102.7	215.6	29.9	28.8
LLVM 22 `-O3 -march=native`	40.9	726.8	206.9	55.4	67.3	255.6	25.7	28.5

LLVM 22 相比 GCC 14 的主要性能区别和 3. aces 一样，就是 ceil/floor 的处理。此外，就是和 731.astcenc_r 类似的情况，在遇到向量化的 min/max 操作的时候，LLVM 22 会正确向量化为对应的 maxps/minps 指令，而 GCC 14 生成的代码就会比较冗长。

小结¶

736.ocio_r 依然是一个比较适合向量化的应用，虽然它不像 731.astcenc_r 那样直接用 vfloat4 格式，但因为它是图像处理，每次循环处理一个像素，然后每个像素有四个通道，在很多情况下，这四个通道的计算过程是一样的，因此也非常适合向量化。而 LLVM 22 在 -O3 下做出了比 GCC 14 更好的指令生成，从 floor/ceil 到 libm 函数的映射，以及更好的向量化实现。当然，开 -O3 -march=native 后，GCC 14 和 LLVM 22 的性能差距非常小，说明在两方都开启足够的指令集扩展以后，基本会收敛到差不多的代码实现上，这也反过来说明，GCC 14 的 SSE 代码生成上有一些欠缺，可能的情况是，并非 GCC 14 不能向量化（因为开 -O3 -march=native 后就学会了），而是尝试向量化后，不知道怎么用 SSE 表达向量化后的代码，于是退回到了标量。

737.gmsh_r¶

737.gmsh_r 是 3D 的 CAD 软件，包括七个负载：

# 1. choi gmsh_r -option gmsh.opts -nt 0 choi.geo # 2. mediterranean gmsh_r -option gmsh.opts -nt 0 mediterranean.geo # 3. projection gmsh_r -option gmsh.opts -nt 0 projection.geo # 4. gasdis gmsh_r -option gmsh.opts -nt 0 gasdis.geo # 5. Torus gmsh_r -option gmsh.opts -nt 0 Torus.geo # 6. spec gmsh_r -option gmsh.opts -nt 0 spec.geo -clscale 0.175 -algo del2d -algo hxt # 7. p19 gmsh_r -option gmsh.opts -nt 0 p19.geo

各负载运行时间为 17.1s、11.8s、11.2s、16.9s、9.2s、13.4s、12.8s，总时间 92.2s，reftime 是 459s，对应 4.98 分。-O3 -ffast-math 和 -O3 -march=native 收益都很小，LLVM 22 反而比 GCC 14 更慢，因此这里就不做具体比较了。

用 -O3 -march=native 编译的时候，发现如果 CC 只传了 gcc，而没有传 -std=c18，就会在 4. gasdis 这一个负载里死循环，一直报错：Info : Symbolic perturbation failed (2 superposed vertices ?)。经过对比，两者的区别在于是否进行乘加融合：-O3 -std=c18 -march=native 时，不会进行融合，而 -O3 -march=native 或 -O3 -std=gnu18 -march=native 时会进行融合，见 Godbolt。在其他程序里，融合对性能更优，但这里很不幸，融合了就会导致死循环。这和 -fp-contract 有关：

-ffp-contract=style   -ffp-contract=off disables floating-point expression contraction. -ffp-contract=fast enables floating-point expression contraction such as forming of fused multiply-add operations if the target has native support for them. -ffp-contract=on enables floating-point expression contraction if allowed by the language standard. This is implemented for C and C++, where it enables contraction within one expression, but not across different statements.   The default is -ffp-contract=off for C in a standards compliant mode (-std=c11 or similar), -ffp-contract=fast otherwise.

可见它只对 C 语言有效，对 C++ 无效，实际上就是只对 737.gmsh_r 有影响；虽然 709.cactus_r 也有 C 代码，但它的主要计算都在 C++ 语言的部分。

接下来针对各负载进行热点分析。

1. choi¶

热点函数：

netgen::ADTree6::GetIntersecting 来自 src/gmsh/contrib/Netgen/libsrc/gprim/adtree.cpp：18.40%，实现了一个 6 维的 KD-Tree 的搜索算法，主要瓶颈在于中间的数据依赖的分支 if (node->pi != -1)，预测错误率较高；
__ieee754_atan2_fma 来自 libm：6.64%；
reparamMeshVertexOnFace 来自 src/gmsh/src/geo/MVertex.cpp：6.03%，根据顶点的维度进入不同的 if-else 分支进行处理，错误预测也比较多。

虽然用到了浮点，但计算模式并不适合向量化。毕竟是 KD-Tree 的搜索，MPKI 高是正常现象。执行了 204.7B 条指令，错误预测 744.3M 次，MPKI 等于 744.3M/204.7B*1000=3.64，是 SPEC FP 2026 Rate 中第二高的。第一高 731.astcenc_r 如上所述，其实是 GCC 的实现不够好，完全可以把 MPKI 优化到 LLVM 22 的 1.3 左右，那样的话 737.gmsh_r 就是第一了。

2. mediterranean¶

热点函数：

meshGEdgeProcessing 来自 src/gmsh/src/mesh/meshGEdge.cpp：36.55%，主要瓶颈在循环中的 gauss seidel 迭代，标量除法和比较耗费了比较多的时间；
KDTreeSingleIndexAdaptor::searchLevel 来自 src/gmsh/src/numeric/nanoflann.hpp：33.50%，又一个经典的 KD-Tree 的搜索算法，根据输入的值递归到左子树或右子树；
InterpolateCurve 来自 src/gmsh/src/geo/GeoInterpolation.cpp：6.53%，递归进行一些插值的计算。

虽然用到了浮点，但计算模式依然不适合向量化，因为中间的计算结果还被用于 if 分支，分支内也有若干浮点计算。

3. projection¶

热点函数：

laplaceSmoothing 来自 src/gmsh/src/mesh/meshGFaceOptimize.cpp：11.73%，主要瓶颈是 std::set 的操作，，而 std::set 是用 std::map 实现的，因此会调用下面的 std::map 的代码；
std::map::_M_get_insert_unique_pos 来自 libstdc++：7.49%，std::map 的插入算法实现；
__ieee754_atan2_fma 来自 libm：7.21%；
reparamMeshVertexOnFace：6.66%，描述见上；
std::map::_M_get_insert_unique 来自 libstdc++：6.09%，std::map 的插入实现；
SetRotationMatrix 来自 src/gmsh/src/geo/Geo.cpp：5.01%，代码是多层循环，适合向量化，编译器也确实向量化了，不过时间占比并不高。

可见，该负载主要还是 std::map 相关的操作为主要瓶颈。

4. gasdis¶

热点函数：

MakeHybridHexTetMeshConformalThroughTriHedron 来自 src/gmsh/src/mesh/meshCombine3D.cpp：30.18%，主要瓶颈是在循环里对 std::map 进行搜索；
parallelDelaunay3D 来自 src/gmsh/contrib/hxt/tetMesh/src/hxt_tetDelaunay.c：9.05%，实现了 Delaunay 三角剖分算法；
hxtRefineTetrahedra 来自 src/gmsh/contrib/hxt/tetMesh/src/hxt_tetRefine.c：5.18%，主要是指循环中做一些浮点计算，包括加减法，乘除法和 sqrt。

瓶颈主要还是在 std::map。

5. Torus、6.spec 和 7.p19¶

最后三个负载，其热点函数都与 4.gadis 相同，不再赘述。

小结¶

各负载的情况：

负载	时间 (s)	指令 (B)	Load (B)	Store (B)	分支 (B)	浮点标量 (B)	浮点向量 (B)	错误预测 (M)	MPKI
1. choi	17.0	204.7	59.3	25.6	39.4	22.1	0.3	744.3	3.64
2. mediterranean	11.7	190.7	57.4	23.2	24.0	28.5	2.4	71.0	0.37
3. projection	11.1	109.0	29.1	14.4	20.3	13.3	2.2	183.0	1.68
4. gasdis	16.9	157.8	46.3	17.8	27.6	19.6	0.2	689.9	4.37
5. Torus	9.2	77.3	21.9	8.2	13.4	9.4	0.5	380.4	4.92
6. spec	13.3	101.4	30.2	10.8	18.1	10.9	0.2	546.1	5.39
7. p10	12.7	96.3	28.8	10.2	17.2	10.4	0.1	529.3	5.50

可见整体的 MPKI 还是偏高的，并且很大程度上归功于 KD-Tree 的查询以及 std::map 的查询或插入，只不过这些树的 key 都是单精度浮点数。并且根据上面的分析，确实相关的代码不适合向量化，浮点乘加融合还被禁用了，否则就可能不收敛。

748.flightdm_r¶

flightdm 是一个飞行动力学模拟器，该基准测试包括如下八项负载：

# 1. weather JSBSim --nohighlight scripts/weather-balloon2.xml # 2. B747 JSBSim --nohighlight scripts/B747_script1.xml # 3. x153 JSBSim --nohighlight scripts/x153.xml # 4. c3104 JSBSim --nohighlight scripts/c3104.xml # 5. ah1s JSBSim --nohighlight scripts/ah1s_flight_test.xml # 6. orbit_torque JSBSim --nohighlight scripts/ball_orbit_g_torque.xml # 7. orbit_torque2 JSBSim --nohighlight scripts/ball_orbit_g_torque2.xml # 8. orbit JSBSim --nohighlight scripts/ball_orbit.xml

各负载的运行时间分别为 5.9s、14.7s、10.9s、11.3s、24.8s、8.0s、9.8s 和 8.4s，一共 93.9s，reftime 是 716s，对应 7.63 分。开 -O3 -march=native 仅对性能有 2% 的提升，-O3 -ljemalloc 反而能提升 4%，-O3 -flto 能提升 11%。LLVM 22 性能不如 GCC 14，这里就不赘述了。下面对各负载进行分析。

1. weather¶

热点函数：

__sincos_fma 来自 libm：6.75%；
__ieee754_atan2_fma 来自 libm：6.41%；
__strncmp_avx2 来自 libc：5.04%；
parse_path 来自 src/JSB-FlightSim/src/simgear/props/props.cxx：4.43%，路径字符串的解析，拆分成多个 component；
__ieee754_pow_fma 来自 libm：4.05%。

热点也挺神奇的，都是一些 libm/libc 的函数，flightdm 自己的代码耗时最多的居然是个路径解析。各种优化选项没啥效果，也不足为奇了。

2. B747¶

热点函数：

SGPropertyNode::getDoubleValue 来自 src/JSB-FlightSim/src/simgear/props/props.cxx：5.65%，看起来是对配置文件的解析，然后从解析结果里提取浮点数；
__ieee754_atan2_fma 来自 libm：5.42%；
__sincos_fma 来自 libm：5.25%；

依然没啥好分析的。

3. x153 和 4. c3104¶

热点函数和 2. B747 相同，不再赘述。

5. ah1s¶

热点函数：

SGPropertyNode::getDoubleValue 来自 src/JSB-FlightSim/src/simgear/props/props.cxx：8.45%，描述见上；
JSBSim::aFunc::getValue 来自 src/JSB-FlightSim/src/math/FGFunction.cpp：7.20%，是一个带有 memo 能力的类似 std::function 的容器；
__sincos_fma 来自 libm：6.04%；
__ieee754_atan2_fma 来自 libm：5.35%；
JSBSim::FGPropertyValue::getValue 来自 src/JSB-FlightSim/src/math/FGPropertyValue.cpp：5.11%，调用上面的 getDoubleValue 函数；

给人的感觉就是，不是在调用 libm 计算一些超越函数，就是在做配置文件内容的提取。

6. orbit_torque¶

热点函数：

__ieee754_atan2_fma 来自 libm：7.52%；
__sincos_fma 来自 libm：6.82%；
__strncmp_avx2 来自 libc：6.57%；
parse_path 来自 src/JSB-FlightSim/src/simgear/props/props.cxx：6.12%，路径字符串的解析，拆分成多个 component；
SGPropertyNode::getChild 来自 src/JSB-FlightSim/src/simgear/props/props.cxx：4.05%，遍历结点的子结点，通过字符串比较，找到匹配的子结点。

7. orbit_torque2 和 8. orbit¶

热点函数与 6. orbit_torque 相同，不再赘述。

小结¶

748.flightdm_r 是个没意思的基准测试，时间很多花在了 libm 和 libc 的函数上，自己的代码就是在配置文件里来回遍历，我愿称它为 libm 基准测试。除此之外，表现得更像一个 SPEC INT 2026 Rate 的负载：字符串操作，内存分配，很多小函数和 lambda，适合 -O3 -flto 优化。最后看一下 -O3 下各负载的情况：

负载	时间 (s)	指令 (B)	Load (B)	Store (B)	分支 (B)	浮点标量 (B)	浮点向量 (B)	错误预测 (M)	MPKI
1. weather	5.9	106.1	30.8	15.4	19.5	12.9	0.6	11.6	0.11
2. B747	14.8	260.1	80.0	38.7	49.4	28.4	1.7	25.6	0.10
3. x153	10.8	193.3	59.1	28.7	37.3	20.0	1.0	20.9	0.11
4. c3104	11.4	194.6	58.9	29.1	35.7	23.9	1.3	18.2	0.09
5. ah1s	24.7	407.3	130.0	61.3	77.9	46.4	1.6	49.3	0.12
6. orbit_torque	7.9	152.8	41.9	22.7	28.3	16.3	1.1	24.2	0.16
7. orbit_torque2	9.9	191.4	52.5	28.4	35.3	21.0	1.2	17.1	0.09
8. orbit	8.4	161.6	44.3	23.9	30.0	17.2	1.0	16.3	0.10

乏善可陈。

749.fotonik3d_r¶

终于出现了一个 SPEC FP 2017 Rate 的老面孔，此前是 549.fotonik3d_r。fotonik3d 做的是 3D 空间里的麦克斯韦方程求解，又一个物理背景的基准测试，一般这种三维空间里的偏微分方程求解，必定会有 Stencil，下面看看这个猜测对不对。该基准测试只有一个负载：

fotonik3d_r

reftime 是 1156s，在不同编译选项下，749.fotonik3d_r 的运行情况：

编译器 + 选项	时间 (s)	分数	相比 GCC 14 `-O3` 性能提升 (%)	指令数 (B)	Load 指令数 (B)	Store 指令数 (B)	分支指令数 (B)	浮点标量指令数 (B)	浮点向量指令数 (B)
GCC 14 `-O3`	131.1	8.82	0	1408.5	375.1	120.7	30.9	5.4	527.2
GCC 14 `-O3 -march=native`	114.9	10.1	14	670.1	274.1	82.4	27.1	5.5	249.4
GCC 14 `-O3 -ffast-math`	116.7	9.91	12	1117.6	378.4	120.8	30.7	4.8	396.2
GCC 14 `-O3 -ffast-math -march=native`	108.5	10.65	21	599.5	276.3	82.3	26.9	4.8	204.8

LLVM 22 性能和 GCC 14 差不多，这里就不单列了。可见 -O3 -march=native 和 -O3 -ffast-math 都有不错的性能提升，下面进行热点分析：

power_dft 来自 src/power.F90：30.92%，进行的是离散傅里叶变化 DFT，主要瓶颈是在循环中进行双精度浮点乘加运算，GCC 14 把它编译成 SSE 的向量指令；
UPML_updateE_simple 来自 src/UPML.F90：24.73%，主要时间在进行三维的 Stencil 计算，果然物理模拟都离不开 Stencil 计算，GCC 14 编译出 SSE 向量指令进行计算；
UPML_updateH 来自 src/UPML.F90：23.26%，依然是 3D 的 Stencil 计算，采用 SSE 向量指令；
mat_updateE 来自 src/material.F90：11.04%，同样是 Stencil 计算，采用 SSE 向量指令；
updateH 来自 src/update.F90：9.78%，也是 Stencil 计算，采用 SSE 向量指令。

由此可见，除了 power_dft 以外，大部分时间都在进行 Stencil 计算，这次 Stencil 计算的模式更加纯粹，因为 GCC 能够比较好地用 SSE 进行向量化。根据前面的经验，这类程序在 -O3 -march=native、-O3 -ffast-math 以及 -O3 -ffast-math -march=native 下都是有很大的提升的：

开启 -march=native 后，可以用更宽的 AVX2 向量，并行度更高，同时还能使用浮点乘加融合指令，例如 vfmaddsub231pd。

开启 -O3 -ffast-math 以后，power_dft 中的核心计算，实际上计算的是，复数乘以实数再加复数，如下面的 Fortran 代码所示：

subroutine update(Efreq1, Efreq2, expfuncE, Efield1, Efield2, n)  implicit none  integer, intent(in) :: n  complex(8), intent(inout) :: Efreq1(n), Efreq2(n)  complex(8), intent(in) :: expfuncE(n)  real(8), intent(in) :: Efield1, Efield2  integer :: i   do i = 1, n  Efreq1(i) = Efreq1(i) + expfuncE(i) * Efield1  Efreq2(i) = Efreq2(i) + expfuncE(i) * Efield2  end do end subroutine update

在 -O3 时，GCC 14 会忠实地实现复数乘法，然而，实际上这里的 Efield1 和 Efield2 都是实数，转换过去的复数的虚部只能是零，因此通过 -O3 -ffast-math 的化简，直接把实部乘到 expfuncE 的实部和虚部即可，这样就可以简化指令。如果开 -O3 -ffast-math -march=native，将可以结合两个优化，直接用 AVX2 乘加融合指令 vfmadd213pd 完成这次运算，不需要像 -O3 -march=native 时用 vfmaddsub231pd 同时做加法和减法（原来的减，来自于复数乘法的定义，在这里减去的总是零，因为 Efield1/Efield2 的虚部是零），详见 Godbolt。

小结一下，749.fotonik3d_r 是经典的浮点应用，大量 Stencil 加浮点向量运算，并行度高，适合向量化，还能享受 -ffast-math 带来的浮点计算顺序优化。

765.roms_r¶

又一个从 SPEC FP 2017 Rate 复活的基准测试，上一世是 554.roms_r，实现的是海洋模拟，不出意外依然是 Stencil，它只有一个负载：

roms_r < roms_benchmark2.in.x

reftime 是 1575s，不同编译器和编译选项下的运行情况：

编译器 + 选项	时间 (s)	分数	相比 GCC 14 `-O3` 性能提升 (%)	指令数 (B)	Load 指令数 (B)	Store 指令数 (B)	分支指令数 (B)	浮点标量指令数 (B)	浮点向量指令数 (B)
GCC 14 `-O3`	169.8	9.28	0	2620.6	874.8	204.7	192.1	193.3	709.2
GCC 14 `-O3 -march=native`	149.5	10.5	14	1317.9	555.3	125.0	126.6	164.9	365.9
GCC 14 `-O3 -ffast-math`	162.8	9.67	4	2518.6	854.5	204.0	178.5	134.0	711.7
LLVM 22 `-O3`	165.6	9.51	3	2434.3	834.9	190.3	164.1	231.8	687.0
LLVM 22 `-O3 -march=native`	152.1	10.4	12	1423.4	551.4	131.2	140.1	259.8	350.0

从以上数据就可以看出，浮点计算很多，高度可向量化，因此 -O3 -march=native 的性能提升是很正常的。

热点函数：

step2d_tile，来自 src/step2d_LF_AM3.h：20.37%，主要瓶颈是 2D 的 Stencil 计算，向量化程度高；
pre_step3d 来自 src/pre_step3d.F90：10.43%，主要瓶颈是在循环当中的浮点计算，向量化程度高；
lmd_skpp 来自 src/lmd_skpp.F90：8.91%，主要瓶颈是循环中的复杂浮点计算，浮点标量计算为主；
step3d_t_tile 来自 src/step3d_t.F90：7.04%，主要瓶颈是 3D 的 Stencil 计算，向量化程度高；
rhs3d 来自 src/rhs3d.F90：6.04%，主要瓶颈是 2D 的 Stencil 计算，向量化程度高；
t3dmix2 来自 src/t3dmix2_geo.h：5.86%，主要瓶颈是 3D Stencil 计算，向量化程度高；
step3d_uv_tile 来自 src/step3d_uv.F90：5.85%，主要瓶颈是 3D Stencil 计算，向量化程度高；
_ZGVbN2v_exp_sse4 来自 libmvec：4.66%，向量化版本的 exp。

还是典型的 Stencil 计算，向量化程度高。开 -O3 -march=native 后，向量宽度增加，加上 FMA 的引入，自然带来了不错的性能提升。

766.femflow_r¶

femflow 是流体动力学求解器，求解 Navier-Stokes 方程。该基准测试只包括一个负载：

femflow_r refrate.prm

reftime 是 1467s，不同编译器和编译选项下的运行情况：

编译器 + 选项	时间 (s)	分数	相比 GCC 14 `-O3` 性能提升 (%)	指令数 (B)	Load 指令数 (B)	Store 指令数 (B)	分支指令数 (B)	浮点标量指令数 (B)	浮点向量指令数 (B)
GCC 14 `-O3`	188.7	7.77	0	3862.4	1358.5	797.6	117.5	562.2	676.0
GCC 14 `-O3 -march=native`	95.1	15.4	98	1736.9	619.3	356.0	65.2	286.8	445.4
GCC 16 `-O3`	153.6	9.55	23	3178.6	1109.3	673.3	127.2	56.3	930.9
GCC 16 `-O3 -march=native`	83.5	17.57	126	1457.0	501.1	281.4	61.1	47.2	545.7
LLVM 22 `-O3`	124.7	11.8	51	2703.0	857.3	475.5	60.6	40.8	930.3
LLVM 22 `-O3 -march=native`	88.7	16.5	113	1392.9	495.7	269.4	42.9	41.8	471.1

可见，LLVM 22 相比 GCC 14 有显著的性能提升，同时 -O3 -march=native 带来了更加显著的性能提升，是整个 SPEC FP 2026 Rate 当中，-O3 -march=native 带来提升第二高的基准测试，第一高是后面会看到的 772.marian_r。GCC 16 相比 GCC 14 也有不错的性能提升，开 -O3 -march=native 后反超 LLVM 22。

热点函数还不少，很多函数都是个位数百分比的占用，大多是一些算子：

Laplace::LaplaceOperator::local_apply_quadratic_geo 来自 src/laplace_operator.h：5.49%，内部是大量的浮点向量计算，并行度高；
operator *(const dealii::VectorizedArray &, const dealii::VectorizedArray &) 来自 src/dealii/include/deal.ll/base/vectorization.h：5.36%，两个向量的逐元素乘法。

其他还有一些 dealii:Tensor 的计算，包括来自 src/dealii/include/deal.ll/matrix_free/tensor_product_kernels.h 的 dealii::internal::even_odd_apply，是 Tensor 双精度浮点乘法的实现，这里 even-odd 的意思是利用数据的对称性，把数据拆成 even 和 odd 两部分进行计算，可以节省计算次数，同时适合向量化。对于这类负载，-O3 -march=native 开启后，更快的向量长度带来了更好的浮点运算性能，同时还有 FMA 指令的加持。

LLVM 22 相比 GCC 14 的优势，主要来自于把更多代码进行了向量化，对比 GCC 14 和 LLVM 22 执行的指令数，可以看到 LLVM 22 执行的浮点标量指令数比 GCC 14 要少，而浮点向量指令又要多。GCC 16 也是类似的情况，向量化程度逼近 LLVM 22。

767.nest_r¶

nest 是个脉冲神经网络的模拟器，忽然出现一个熟悉的面孔，也挺难得。该基准测试分为三个负载：

# 1. cuba nest_r cuba_stdp.sli # 2. structural nest_r structural_plasticity_benchmark # 3. Artificial nest_r ArtificialSynchrony

开 -O3 -march=native 只有 3% 的性能提升，LLVM 22 比 GCC 14 更慢，这里就不进行编译器和编译选项的对比了。三个负载在 GCC 14 -O3 下的对比：

负载	时间 (s)	指令 (B)	Load (B)	Store (B)	分支 (B)	浮点标量 (B)
1. cuba	14.1	176.3	54.5	21.6	22.4	29.2
2. structural	24.6	413.3	136.3	42.8	52.5	93.2
3. Artificial	48.6	1125.4	392.6	150.5	160.5	163.6

总时间 87.4s，reftime 是 793s，对应 9.07 分。下面进行负载的具体分析。

1. cuba¶

热点函数：

nest::iaf_psc_exp::handle 来自 src/nest-simulator/models/iaf_psc_exp.cpp：25.75%，处理该神经元接收到的脉冲，更新内部状态，主要瓶颈是间接访存，把脉冲的强度写入到对应的输入缓存区；
__ieee754_pow_fma 来自 libm：11.96%，被后面的 nest::Connector::send 函数调用；
spec::poisson_distribution::operator() 来自 src/specrand-distributions/spec_random_distributions.cpp：9.87%，生成随机数，以生成输入的脉冲；
nest::Connector::send 来自 src/nest-simulator/nestkernel/connector_base.h：8.29%，负责脉冲在突触上的传播和 STDP，主要瓶颈是间接访存，以及内联了一些脉冲上的权重计算，还会调用 pow 和 exp；
nest::iaf_psc_exp::update 来自 src/nest-simulator/models/iaf_psc_exp.cpp：6.91%，在每个时间步对神经元的状态进行更新，主要是标量的浮点运算。

算是一个比较经典的带 STDP 的 SNN 模拟，主要瓶颈就是脉冲传播和 STDP 的突触权重更新，向量化程度很低，还有间接访存。

2. structural¶

热点函数：

spec::poisson_distribution::operator() 来自 src/specrand-distributions/spec_random_distributions.cpp：24.26%，描述见上；
nest::iaf_psc_alpha::update 来自 src/nest-simulator/models/iaf_psc_alpha.cpp：13.71%，做的事情和上面 nest::iaf_psc_exp::update 类似，就是换了个神经元模型；
__ieee754_pow_fma 来自 libm：13.37%，描述见上；
nest::GrowthCurveGaussian::update 来自 src/nest-simulator/nestkernel/growth_curve.cpp：6.60%，主要在用数值计算求解微分方程，频繁调用 exp 和 pow；
nest::iaf_psc_alpha::handle 来自 src/nest-simulator/models/iaf_psc_alpha.cpp：25.75%，功能和上面 nest::iaf_psc_exp::handle 类似；
nest::Connector::send 来自 src/nest-simulator/nestkernel/connector_base.h：6.60%，描述见上，这次没有 STDP，权重是静态的；
exp 来自 libm：5.39%。

和 1. cuba 相比，换了一个神经元模型，去掉了 STDP，结果主要的瓶颈跑到了泊松分布的随机生成，其余部分还是比较典型的 SNN 模拟。

3. Artificial¶

热点函数：

nest::iaf_psc_alpha_ps::update 来自 src/nest-simulator/models/iaf_psc_alpha_ps.cpp：13.26%，神经元的状态更新函数；
nest::iaf_psc_alpha::update 来自 src/iaf_psc_alpha.cpp：12.37%，描述见上；
nest::Connector::send 来自 src/nest-simulator/nestkernel/connector_base.h：7.19%，描述见上，这次依然没有 STDP，权重是静态的；
nest::SimulationManager::update_ 来自 src/nest-simulator/nestkernel/simulation_manager.cpp：5.66%，核心的 SNN 模拟循环，调用上面的各种函数。
__ieee754_pow_fma 来自 libm：5.17%，描述见上。

小结¶

研究 SNN 的应该很熟悉，nest 是个很灵活的 SNN 模拟器，但单线程性能也确实不咋地，主要精力花在了多核/多线程上。不出所料，nest 的神经元更新部分没有向量化，所以挺慢的，而脉冲传播和 STDP 部分本来就很难优化。总之，这是个难以向量化的浮点应用，从上面的性能计数器来看，一条向量浮点指令都没有。

772.marian_r¶

marian_r 是一个基于神经网络的翻译器，又是一个神经网络推理，意味着又是一个 -O3 -march=native 非常有优势的测例，如果像 706.stockfish_r 那样有直接可以用的硬件加速指令，性能将会比 -O3 快得多。该基准测试包括两个负载：

# 1. TildeMODEL marian-decoder --cpu-threads 1 -m model.alphas.npz -v vocab.spm vocab.spm --beam-size 1 --mini-batch 32 --maxi-batch 100 --maxi-batch-sort src -w 512 --skip-cost --gemm-type intgemm8 --intgemm-options precomputed-alpha standard-only --quiet --quiet-translation -i TildeMODEL-spec.en --log TildeMODEL-spec.log --log-level off -o TildeMODEL-spec.out # 2. EuroPat marian-decoder --cpu-threads 1 -m model.alphas.npz -v vocab.spm vocab.spm --beam-size 1 --mini-batch 32 --maxi-batch 100 --maxi-batch-sort src -w 512 --skip-cost --gemm-type intgemm8 --intgemm-options precomputed-alpha standard-only --quiet --quiet-translation -i EuroPat-spec.en --log EuroPat-spec.log --log-level off -o EuroPat-spec.out

reftime 是 1579s，下面是不同编译器版本和编译选项的对比：

编译器 + 选项	时间 (s)	分数	相比 GCC 14 `-O3` 性能提升 (%)	1. TildeMODEL 时间 (s)	2. EuroPat 时间 (s)
GCC 14 `-O3`	235.2	6.71	0	88.8	146.4
GCC 14 `-O3 -march=native`	78.4	20.14	200	28.2	50.3
GCC 15 `-O3`	150.1	10.52	57	56.0	94.8
GCC 15 `-O3 -march=native`	77.5	20.37	203	27.8	49.7

可见 -O3 -march=native 带来的提升巨大，高达 200%，在 Apple M1 上有 47% 的提升，在 Apple M2 上更是提升了 92%，这种提升，之前只在 706.stockfish_r 上见到过。并且 GCC 15 也比 GCC 14 在 -O3 时有明显性能提升。下面分负载来讨论。

1. TildeMODEL¶

热点函数：

marian::cpu::integer::affineOrDotTyped 来自 src/marian/tensors/cpu/intgemm_interface.h：82.28%，主要时间在 tiled_gemm 函数里，做的是整数矩阵乘法，uint8_t 类型的 A 矩阵乘以 int8_t 类型的 B 矩阵，累加到 int32_t 类型，最后转换到 float 再加 float 的 C 矩阵；
marian::cpu::ProdBatched 来自 src/marian/tensors/cpu/prod.cpp：10.30%，核心部分是 sgemm，这次确实是浮点的矩阵运算了，虽然被编译成了 SSE 的标量的浮点计算而不是向量，但考虑到时间占比，也无伤大雅了。

可以看到，主要的热点部分，和 706.stockfish_r 的 nnue 的计算模式完全一样，因此开 -O3 -march=native 后，一样可以用 AVX-VNNI 的 vpdpbusd 指令优化，见 Godbolt。同理 GCC 15 因为更优的无符号扩展实现方式，性能比 GCC 14 要更好。具体的讨论，可以见之前 INT Rate 篇中 706.stockfish_r 的部分。

不同编译器和编译选项下的对比：

编译器 + 选项	时间 (s)	指令 (B)	Load (B)	Store (B)	分支 (B)	浮点标量 (B)	浮点向量 (B)	128 位整数向量 (B)	256 位整数向量 (B)
GCC 14 `-O3`	88.2	2038.9	217.8	57.8	53.2	58.7	2.1	514.6	0.0
GCC 14 `-O3 -march=native`	27.6	423.0	131.5	25.1	47.4	59.8	1.1	12.8	47.4
GCC 15 `-O3`	55.6	1353.5	173.9	22.1	53.2	58.7	2.1	184.7	0.0
GCC 15 `-O3 -march=native`	27.3	415.1	128.9	23.5	47.5	59.8	1.1	12.8	47.4

其中 128 位整数向量来自 int_vec_retired.128bit 计数器，256 位整数向量来自 int_vec_retired.256bit 计数器。

2. EuroPat¶

热点函数：

marian::cpu::integer::affineOrDotTyped：78.96%，描述见上；
marian::cpu::ProdBatched：14.25%，描述见上。

热点函数和 1. TileMODEL 完全相同，其余的分析对 2. EuroPat 也是成立的，这里直接给出性能计数器的对比：

不同编译器和编译选项下的对比：

编译器 + 选项	时间 (s)	指令 (B)	Load (B)	Store (B)	分支 (B)	浮点标量 (B)	浮点向量 (B)	128 位整数向量 (B)	256 位整数向量 (B)
GCC 14 `-O3`	145.6	3352.7	370.4	89.7	98.8	123.8	3.6	815.0	0.0
GCC 14 `-O3 -march=native`	49.7	777.2	228.7	36.6	88.3	123.9	1.7	19.9	72.6
GCC 15 `-O3`	94.2	2268.5	301.7	33.1	98.8	123.8	3.6	293.6	0.0
GCC 15 `-O3 -march=native`	49.0	765.3	225.2	34.3	88.3	123.9	1.7	19.9	72.6

小结¶

772.marian_r 鉴定为 706.stockfish_r 的 NNUE 翻版，热点就是 int8_t 乘 uint8_t 累加到 int32_t 的矩阵乘运算，整数向量指令比浮点指令还多，建议开除 SPEC FP 2026 Rate 籍。

782.lbm_r¶

lbm 是 lattice boltzmann method 的缩写，又是一个流体动力学的应用，依然是 Stencil。该基准测试只有一个负载：

lbm_r 900 reference.dat 0 0 200_200_130_ldc.of

reftime 是 573s，不同编译选项下的性能对比：

编译器 + 选项	时间 (s)	分数	相比 GCC 14 `-O3` 性能提升 (%)	指令数 (B)	Load 指令数 (B)	Store 指令数 (B)	分支指令数 (B)	浮点标量指令数 (B)
GCC 14 `-O3`	105.8	5.42	0	2232.2	473.3	242.4	14.5	1108.2
GCC 14 `-O3 -ffast-math`	95.8	5.98	10	1892.4	419.2	192.8	14.5	1009.5
GCC 14 `-O3 -march=native`	131.0	4.37	-19	1669.6	550.3	309.8	14.5	1228.8
GCC 15 `-O3`	105.2	5.45	0.6	2218.9	468.9	242.4	14.5	1108.2
GCC 15 `-O3 -march=native`	111.0	5.16	-5	1777.3	509.8	282.9	14.5	1108.2
GCC 16 `-O3`	105.4	5.44	0.4	2218.9	468.9	242.4	14.5	1108.2
GCC 16 `-O3 -march=native`	110.6	5.18	-4	1777.3	509.8	282.9	14.5	1108.2

热点函数只有一个，就是 LBM_performStreamCollideTRT 函数来自 src/lbm.c，占了 99.35% 的时间。其结构是从当前轮次 Grid 读取、大量浮点计算、写入下一轮次 Grid，中间还有分支判断，访存为跨步（strided）模式，难以向量化，生成的都是 SSE 标量指令。对于这种标量计算密集的情况，-O3 -ffast-math 通常能通过调整计算顺序、复用中间结果来节省一些计算。

开启 -O3 -march=native 后性能反而下降，GCC 14 倒退最多（-19%），GCC 15/16 稍好但也不如 -O3。分析汇编，推测是因为对栈的访存指令变多，抵消了 FMA 乘加融合减少指令数的优势，详见 Godbolt。注意 FMA 指令在上述表格的浮点标量指令数一栏会计数两次，在总指令数一栏只会计数一次。

讨论¶

编译器选项对比¶

综合来看，编译选项对 SPEC FP 2026 Rate 的性能影响同样不小：

-march=native 对很多基准测试有不错的性能提升。毕竟 AVX2 相比 SSE 不仅在宽度上拓宽，还增加了很多好用的指令，可以减少指令数，还有 AVX-VNNI 这种对 772.marian_r 特攻的；
-ffast-math 也有不错的提升，尤其 SPEC FP 2026 Rate 有不少浮点运算，完全按照源码的编写方式去计算，往往不如调整运算顺序后来得快。但也要注意，-ffast-math 可能会导致计算结果不符合 IEEE 754 标准。
-flto 和 -ljemalloc 对 SPEC FP 2026 Rate 的多数基准测试效果不大，但对 748.flightdm_r 有些许提升。

还有一些常用的编译参数，比如 -static、-fomit-frame-pointer 等等，目前没有做太多测试，以后说不定会加上。

分支预测¶

SPEC FP 2026 Rate 中 MPKI 特别高的只有 731.astcenc_r 和 737.gmsh_r，其他最高也就是 767.nest_r 的 0.87。731.astcenc_r 如此的高，完全是 GCC 14 编译的锅，换成 LLVM 22 立马就正常了，希望后续 GCC 能修一修。

总结¶

本文深入分析了 SPEC CPU 2026 中 FP Rate 的负载，供编译器和处理器的设计者参考。从编译器的角度来说，可以集 GCC 和 LLVM 之长，进一步提升性能；从处理器的角度来说，针对程序的瓶颈进行优化，也能进一步提高分数。

SPEC CPU 2026 Workload Analysis (INT Rate)

Fri, 22 May 2026 00:00:00 +0000

SPEC CPU 2026 Workload Analysis (INT Rate)¶

中文版本

Background¶

I've been running some benchmarks with SPEC CPU 2026 recently, and plan to do in-depth workload analysis combined with the test results. This article focuses on SPEC INT 2026 Rate workload characteristics. For SPEC FP 2026 Rate analysis, see the FP Rate article.

Test environment: CPU is Intel i9-14900K P-Core @ 5.7 GHz, Linux distribution is Debian Trixie, compiler is GCC 14.2.0, default compilation flags are -O3. This CPU can actually boost up to 6.0 GHz, but occasionally fails to boost under single-core workloads for unknown reasons (degradation protection?), specifically manifesting as the CPU core being forced down to 4.7 GHz after running for a while. So I opted for the more reliably achievable 5.7 GHz. Only one physical P-core can stably run at 6.0 GHz; other P-cores can all reach 5.7 GHz, and switching to another core when throttling occurs is sufficient. Performance at 6.0 GHz can be referenced from previous test results: INT and FP, basically, from 5.7 GHz to 6.0 GHz, performance scales linearly with frequency. This article may give multiple different runtimes for the same workload, which could be due to performance variance across multiple runs or because some numbers include perf record overhead, but the errors are small enough for reliable comparison. The scripts used in this article are open-sourced at jiegec/spec2026.

SPEC INT 2026 Rate Analysis¶

706.stockfish_r¶

Stockfish is a well-known chess engine. This benchmark includes three workloads:

# 1. 1to6_classical stockfish bench 1600 1 26 spec_ref_pos_1to6.fen depth classical # 2. 1to6_nnue stockfish bench 1600 1 26 spec_ref_pos_1to6.fen depth nnue # 3. 7to11_nnue stockfish bench 1600 1 26 spec_ref_pos_7to11.fen depth nnue

Measured data shows the three workloads take 47s, 77s, and 72s respectively, totaling 196s. The reftime is 1260s, corresponding to 6.4 points. With -march=native enabled, 1to6_classical time decreases by 10% to 43s, while 1to6_nnue and 7to11_nnue significantly decrease to 32s and 31s, total time 105s, corresponding to 12 points, a significant score improvement. Below is a per-workload performance analysis.

1. 1to6_classical¶

Using perf to observe performance bottlenecks, the major hotspot functions for 1to6_classical and their time shares are listed below (subsequent benchmarks use the same representation):

Stockfish::Eval::evaluate(const Position& pos) from src/evaluate.cpp: 19.16%, inlines the Evaluation<NO_TRACE>(pos).value() call, mainly evaluating board positions with scattered memory accesses and computations, no particularly concentrated hotspot instructions;
Stockfish::TranspositionTable::probe(const Key key, bool& found) from src/tt.cpp: 17.91%, the main bottleneck is random memory access in first_entry(key) which contains &table[mul_hi64(key, clusterCount)].entry[0], where mul_hi64 computes the upper 64 bits of a 64-bit integer multiplication, so the memory address is computed from the argument; for mul_hi64, GCC 14 faithfully splits the 64-bit values into high and low 32-bit halves, while LLVM 22 correctly recognizes the code's intent and uses AMD64's mul instruction directly. This was implemented in PR #168396, with mul_hi64 corresponding to "Ladder" in the PR description; in fact, Stockfish's original code uses __int128 which GCC 14 can also compile efficiently, but unfortunately this C syntax extension was disabled by SPEC (assembly comparison at Godbolt);
Stockfish::MovePicker::next_move(bool skipQuiets) from src/movepick.cpp: 10.36%, the slow part is partial_insertion_sort: after finding the insertion position, the subsequent array elements must be shifted to make room;
Stockfish::search(Position& pos, Stack* ss, Value alpha, Value beta, Depth depth, bool cutNode) from src/search.cpp: 9.49%, the main search logic is implemented here;
__popcountdi2 from libgcc: 7.52%, called by Stockfish::Eval::evaluate(const Position& pos) to determine board conditions using bit operations. Interested readers can refer to Hacker's Delight.

With -march=native enabled, __popcountdi2 is inlined as a popcnt instruction. Testing shows that enabling -mpopcnt alone reduces time from 47s to 44s, close to -march=native performance. Simply enabling the popcnt ISA extension and eliminating the __popcountdi2 function call overhead brings noticeable performance improvement.

Under -O3, 1to6_classical executes 531.8B instructions (instructions perf counter), with 135.7B Load instructions (mem_inst_retired.all_loads counter), 59.7B Stores (mem_inst_retired.all_stores counter), 56.0B branch instructions (branch-instructions counter), of which 2622.8M are mispredicted (branch-misses counter). The MPKI is quite high: 2622.8M/531.8B*1000=4.93. Even among SPEC INT 2017 benchmarks, this is higher than 531.deepsjeng_r's 3.16 and 557.xz_r's 3.49, but lower than 505.mcf_r's 6.24 and 541.leela_r's 7.71.

Using perf record -e branch-misses:pp, the main branch mispredictions come from Stockfish::MovePicker::next_move() contributing 27.48%, mainly from the insertion sort, i.e., finding the insertion position and shifting existing elements. Next is Stockfish::Eval::evaluate() at 17.42%, then Stockfish::search() at 13.06%.

With -O3 -mpopcnt, instruction count drops to 453.9B, with 124.2B Loads, 53.1B Stores, 46.1B branch instructions, and still 2.6B mispredictions. Just inlining the __popcountdi2 call saves 77.9B instructions, about 15% of the original. __popcountdi2 itself is 21 instructions, plus one jmp in __popcountdi2@plt, plus the call __popcountdi2@plt itself and register save/restore overhead.

2. 1to6_nnue¶

The latter two workloads switch from classical to nnue engine (involving neural networks), so the computation pattern is different. perf shows the main time-consuming functions for 1to6_nnue:

Stockfish::Eval::NNUE:evaluate(const Position& pos, bool adjusted) from src/nnue/evaluate_nnue.cpp: 80.59%, main time spent in affine_transform_non_ssse3's sum += weights[offset + j] * input[j], i.e., neural network inference. It computes int8_t multiplied by uint8_t, accumulated into int32_t result. Under default flags, only basic SSE instructions like pmaddwd/paddd can be used, not AVX;
Stockfish::TranspositionTable::probe(const Key key, bool& found) from src/tt.cpp: only 4.81%, same random memory access bottleneck as before.

Analyzing the Stockfish::Eval::NNUE:evaluate instructions: to implement the above logic, the core approach uses the pmaddwd instruction for 4 signed 16-bit multiplications accumulated into 32-bit results. But first, the 8-bit signed weights and unsigned input must be extended to signed 16-bit. Signed 8-bit weights extension is straightforward, while unsigned 8-bit input handling is complex. First, it adds 128 to each input element, then treats it as signed, effectively subtracting 128, mapping uint8_t to int8_t. This allows input to use the same sign extension method as weights. However, this introduces error in the result, so to correct the bias, 128 times the sum of weights is subtracted. Assembly code (Godbolt):

1: # Load 16 signed weights elements movdqu (%rdx,%rcx,1),%xmm2 movdqa %xmm5,%xmm8 # Load 16 unsigned input elements movdqa (%r12,%rcx,1),%xmm10 add $0x10,%rcx # Sign-extend weights pcmpgtb %xmm2,%xmm8 movdqa %xmm2,%xmm9 # Add 128 to each input element, i.e., subtract 128 to convert to signed int8_t paddb %xmm6, %xmm10 # Sign-extend weights punpckhbw %xmm8,%xmm2 punpcklbw %xmm8,%xmm9 movdqa %xmm2,%xmm11 movdqa %xmm9,%xmm8 # Compute weights sum times 128 pmaddwd %xmm3,%xmm11 pmaddwd %xmm7,%xmm8 paddd %xmm11,%xmm0 paddd %xmm8,%xmm0 paddd %xmm11,%xmm0 movdqa %xmm5,%xmm11 # Sign-extend input pcmpgtb %xmm10,%xmm11 paddd %xmm8,%xmm0 movdqa %xmm10,%xmm8 punpckhbw %xmm11,%xmm10 punpcklbw %xmm11,%xmm8 # Compute weights * input pmaddwd %xmm10,%xmm2 pmaddwd %xmm8,%xmm9 # Accumulate results paddd %xmm2,%xmm0 paddd %xmm9,%xmm0 cmp $0x400,%rcx jne 1b

For SIMD-friendly code like this, -march=native typically brings significant improvement, as confirmed by testing: time drops from 77s to 32s, Stockfish::Eval::NNUE::evaluate share drops to 54.20%, with the main computation instruction becoming the AVX-VNNI extension's vpdpbusd (Multiply and Add Unsigned and Signed Bytes), a fused integer multiply-add for byte elements (weights are int8_t, input are uint8_t), with int32_t accumulator. Core loop (Godbolt):

1: # Load unsigned input vmovdpa (%r8,%rcx,1),%ymm0 # Load signed weights and compute sum += weights[offset + j] * input[j] {vex} vpdpbusd (%rdx,%rcx,1),%ymm0,%ymm2 add $0x20,%rcx cmp $0x400,%rcx jne 1b

If the CPU supports AVX512-VNNI, this can be further widened to 512-bit: vpdpbusd (%rdx,%rax), %zmm1, %zmm0. Note that simply enabling -mavx2 only reduces time from 77s to 50s, still far from -march=native's 32s: even with AVX enabled (Godbolt), without AVX-VNNI the vpdpbusd instruction is unavailable, requiring format conversion to 16-bit followed by 16-bit integer multiply-add with 32-bit accumulator. Stockfish's NNUE computation is designed around the vpdpbusd instruction. CPUs lacking this instruction, or where the compiler doesn't utilize it, will see significantly lower performance.

On ARM64, the corresponding USDOT (Dot product with unsigned and signed integers (vector)) instruction is part of the i8mm extension. With this extension, -march=native provides significant improvement (Godbolt), e.g., Apple M2; without it, -march=native makes no difference, e.g., Apple M1, falling back to extend-to-16-bit-then-sum like AMD64 (Godbolt). RISC-V Vector extension has the vwmulsu.vv instruction, yielding 16-bit multiplication results, then vwadd.wv to accumulate to 32-bit (Godbolt). LoongArch also has corresponding xvmulwev.h.b/xvmulwod.h.b instructions yielding 16-bit results, then xvhaddw.w.h to accumulate to 32-bit (Godbolt), which can be further optimized using xvmulwev.h.bu.b, and the optimized transform function is 37% faster than GCC 16.

Beyond ISA extension enablement, GCC 15 shows notable performance improvement over GCC 14 on 1to6_nnue (with -O3), from 77s to 49s. Examining the generated instructions: although still using SSE, the instruction sequence is more concise (Godbolt):

# %xmm5 initialized to all zeros 1: # Load 16 signed weights elements movdqu (%rdx,%rcx,1),%xmm4 movdqa %xmm5,%xmm8 # Load 16 unsigned input elements movdqa (%r12,%rcx,1),%xmm2 add $0x10,%rcx # Compare weights with zero: non-negative gives 0, negative gives 0xFF pcmpgtb %xmm4,%xmm8 movdqa %xmm2,%xmm6 movdqa %xmm4,%xmm7 # Zero-extend input from 8-bit unsigned to 16-bit, saved in %xmm2 and %xmm6 punpckhbw %xmm5,%xmm2 punpcklbw %xmm5,%xmm6 # Combined with pcmpgtb above, sign-extend weights from 8-bit signed to 16-bit, saved in %xmm4 and %xmm7 punpckhbw %xmm8,%xmm4 punpcklbw %xmm8,%xmm7 # Each pmaddwd performs 4 times 16-bit * 16-bit + 16-bit * 16-bit = 32-bit # Two pmaddwd together complete 8 16-bit multiplications and 8 32-bit additions pmaddwd %xmm4,%xmm2 pmaddwd %xmm7,%xmm6 # Each paddd performs 4 32-bit accumulations paddd %xmm2,%xmm0 paddd %xmm6,%xmm0 cmp $0x400,%rcx jne 1b

Even without the dedicated vpdpbusd instruction, SSE-only optimization space remains. GCC 15 efficiently implements signed and unsigned sign extension via SSE, achieving performance between GCC 14's suboptimal instruction sequence and the dedicated vpdpbusd instruction. This is also mentioned in SPEC CPU2026: Characterization, Representativeness, and Cross-Suite Comparison: For example, gcc-15 reduces the instruction count of 706.stockfish_r by up to 3x, though that number is relative to GCC 13; the reduction vs. GCC 14 is less dramatic (see Figure 10 and Figure 16 in the paper). Measured here: from GCC 14's 1342B instructions down to GCC 15's 1015B. In comparison, LLVM 22's SSE (-O3, Godbolt) or AVX (-O3 -march=alderlake, Godbolt) sequences are less efficient than GCC 15.

Under -O3, 1to6_nnue executes 1342.1B instructions, with 182.2B Loads, 61.8B Stores, 229.1B 128-bit integer vector instructions (e.g., SSE, int_vec_retired.128bit counter), 77.6B branch instructions, with 1612.9M mispredictions. Its MPKI is only 1612.9M/1342.1B*1000=1.20; the main bottleneck is the neural network inference above.

GCC 15 under -O3: 1to6_nnue instruction count drops to 1015.3B, with 175.0B Loads, 57.8B Stores, only 97.0B 128-bit integer vector instructions, 77.4B branch instructions, showing significant optimization.

GCC 14 under -march=native: 1to6_nnue instruction count plummets to 446.8B (only one-third of the original), with 119.6B Loads, 44.4B Stores, 48.7B branch instructions, 13.2B 256-bit AVX VNNI instructions (int_vec_retired.vnni_256 counter), showing significant optimization.

3. 7to11_nnue¶

7to11_nnue behaves similarly to 1to6_nnue, with the bottleneck also in Stockfish::Eval::NNUE:evaluate. Enabling -march=native reduces time from 72s to 31s. GCC 15's improvement is also similar to 1to6_nnue, from 72s to 46s.

Under -O3, 7to11_nnue executes 1253.2B instructions, with 176.1B Loads, 61.6B Stores, 212.5B 128-bit integer vector instructions, 75.4B branch instructions, with 1547.5M mispredictions. Its MPKI is only 1547.5M/1253.2B*1000=1.23; the main bottleneck remains neural network inference.

GCC 15 under -O3: 7to11_nnue instruction count drops to 955.3B, with 169.4B Loads, 57.8B Stores, only 92.3B 128-bit integer vector instructions, 75.2B branch instructions, showing significant optimization.

GCC 14 under -march=native: 7to11_nnue instruction count plummets to 425.9B (only one-third), with 115.1B Loads, 43.7B Stores, 47.1B branch instructions, 12.0B 256-bit AVX VNNI instructions, showing significant optimization.

Summary¶

Performance under different compilation options:

Workload	Compiler + Flags	Time (s)	Insns (B)	Load (B)	Store (B)	Branch (B)	Mispredictions (M)	MPKI	128-bit Int Vec (B)	256-bit Int Vec (B)
1. 1to6_classical	GCC 14 `-O3`	47	531.8	135.7	59.7	56.0	2622.8	4.93	0.13	0.00
1. 1to6_classical	GCC 14 `-O3 -mpopcnt`	44	453.9	124.2	53.1	46.1	2639.3	5.81	0.13	0.00
2. 1to6_nnue	GCC 14 `-O3`	77	1342.1	182.2	61.8	77.6	1612.9	1.20	229.1	0.00
2. 1to6_nnue	GCC 15 `-O3`	49	1015.3	175.0	57.8	77.4	1258.2	1.24	97.0	0.00
2. 1to6_nnue	GCC 14 `-march=native`	32	446.8	119.6	44.4	48.7	953.8	2.13	5.1	36.3
3. 7to11_nnue	GCC 14 `-O3`	72	1253.2	176.1	61.6	75.4	1547.5	1.23	212.5	0.00
3. 7to11_nnue	GCC 15 `-O3`	46	955.3	169.4	57.8	75.2	1224.7	1.28	92.3	0.00
3. 7to11_nnue	GCC 14 `-march=native`	31	425.9	115.1	43.7	47.1	922.9	2.17	4.6	35.0

1to6_classical resembles a traditional chess engine with complex branching and memory access, so its MPKI=4.93 is similar to SPEC CPU 2017's 531.deepsjeng_r (MPKI=3.16), falling in the higher category. Meanwhile, 1to6_nnue and 7to11_nnue are mainly bottlenecked by i8 matrix operations; whether hardware acceleration instructions (here AVX-VNNI) are available has a major performance impact, with branch prediction becoming much less significant. The overall average MPKI is 1.85, not particularly high.

707.ntest_r¶

ntest is an Othello (Reversi) engine. The benchmark includes:

ntest_r Othello.154.ggf 20 16

Measured runtime is 140s. The reftime is 592s, corresponding to 4.2 points. With various optimized flags: -O3 -flto vs. -O3 brings 4% improvement; further -O3 -flto -march=native vs. -O3 -flto brings another 10%. Below is detailed workload analysis. Othello rules are simple: you can only place a piece at an empty position if it flips at least one opponent's piece, otherwise you pass. The flipping rule: along all 8 directions (horizontal, vertical, diagonal), if all pieces between the new piece and another of your own pieces are opponent's pieces, they all get flipped. perf shows these high-time-share functions:

flips(int sq, u64 mover, u64 enemy) from src/flips.cpp: 34.80%, the main cost. Based on board state, through memory accesses and bit operations, it first checks neighbors[sq]&enemy for adjacent enemy pieces (none means cannot play), then computes which pieces get flipped. Mainly data-dependent memory accesses mixed with bit operations;
solveNParity(int alpha, int beta, u64 mover, u64 enemy, u64 parity, EndgameSearch* search, bool hasPassed) from src/solve.cpp: 14.21%, alpha-beta pruning minimax (negamax variant), iterating over empty positions. It first finds those with good parity (using bitSet() which uses AMD64's bt instruction, since in Othello the player making the last move gains an advantage, so it prioritizes positions giving the last move), calling flips() to check for flips, recursing if flips occur, then iterating again for bad parity positions. Main bottleneck is memory access and data-dependent branches;
__popcountdi2: 9.65%, without -mpopcnt/-march=native, needed for counting pieces of each color, etc.;
solveNFlipParity: 8.95%, works with solveNParity to complete the minimax algorithm;
solve2: 5.38%, part of the minimax algorithm, handling the final position with only two empty squares, where determining the winner is straightforward without further recursion.

This is a typical chess engine pattern: the entire minimax algorithm takes 70%+ of the time, with extensive bit operations and memory accesses for position searching, plus data-dependent branches. Indeed: 2688.3B instructions executed, with 647.8B Loads, 255.2B Stores, 228.2B branches, 6.1B mispredictions, MPKI reaching 6.1B/2688B*1000=2.27. Via perf record -e branch-misses:pp, solveNParity and solveNFlipParity together contribute 60.37% of mispredictions, mainly from the loop's good/bad parity checks and linked list insertion NULL checks, all data-dependent branches.

Similar to 706.stockfish_r, it has significant popcnt calls, so enabling -mpopcnt gives nice improvement: time drops from 140s to 126s (11% reduction), instructions reduce to 2286.9B with 586.9B Loads, 206.7B Stores, 187.6B branches. Even with -march=native, performance only further drops to 122s, with minimal AVX2 usage.

On the other hand, LLVM 22 is faster than GCC 14 on 707.ntest_r: with the same -O3 flags, runtime drops from GCC 14's 140s to 126s. Investigating the assembly reveals that LLVM 22, without -mpopcnt, directly inlines code similar to libgcc's __popcountdi2 into the program, saving the libgcc call overhead at the cost of larger code size, executing 2416.9B instructions with 542.7B Loads, 202.9B Stores, 168.2B branches. Similarly, 706.stockfish_r's 1to6_classical is also faster with LLVM 22 vs. GCC 14, from 47s to 44s.

Meanwhile, GCC 15 also improves over GCC 14, from 140s to 130s. Assembly analysis reveals the main optimization in flips(int sq, u64 mover, u64 enemy). Two performance differences:

Callee-saved register usage: GCC 14 performs a series of push/pop in prologue/epilogue unconditionally, while GCC 15 is smarter, only performing push/pop when if (neighbors[sq]&enemy) is true and the complex function body requiring callee-saved registers is needed, otherwise returning directly, since the condition check doesn't use callee-saved registers.
The self-compiled GCC 15 defaults to -no-pie mode while the distro's GCC 14 defaults to -pie. In -no-pie mode, absolute addresses allow memory operands directly in imul etc., saving registers and eliminating the need for callee-saved registers, removing push/pop overhead entirely. -static provides similar benefit. The first point was observed after manually adding -pie to GCC 15. The main performance gain comes from reducing push/pop execution count.

GCC 15's 707.ntest_r executes 2429.3B instructions with 610.9B Loads, 206.2B Stores, 224.7B branches. Results under different compilers and flags:

Compiler + Flags	Time (s)	Insns (B)	Load (B)	Store (B)	Branch (B)
GCC 14 `-O3`	140	2688.3	647.8	255.2	228.2
GCC 14 `-O3 -flto`	134	2656.3	623.4	251.3	200.9
GCC 14 `-O3 -mpopcnt`	126	2286.9	586.9	206.7	187.6
GCC 14 `-O3 -march=native`	122	2230.0	588.2	206.4	185.2
LLVM 22 `-O3`	126	2416.9	542.7	202.9	168.2
GCC 15 `-O3`	130	2429.3	610.9	206.2	224.7

Combining 706.stockfish_r and 707.ntest_r shows that popcnt is quite commonly used. Unfortunately, the AMD64 baseline doesn't provide this instruction, so with x86-64-v2 or higher optimization flags, such applications can use a single popcnt instruction to eliminate the libgcc __popcountdi2 call overhead. Compared to AVX-VNNI, popcnt is far more widely available.

708.sqlite_r¶

sqlite is the famous database and needs no introduction. The benchmark includes three workloads:

# 1. main sqlite_r --memdb --size 2000 --testset main --verify # 2. cte sqlite_r --memdb --size 2000 --testset cte --verify # 3. fp sqlite_r --memdb --size 1000 --testset fp --verify

Measured times: 69s, 12s, and 25s respectively, totaling 106s. The reftime is 528s, corresponding to 5.0 points. Enabling -flto/-ljemalloc has minimal impact; -march=native even causes regression. Below is per-workload analysis.

1. main¶

perf hotspot functions:

sqlite3BtreeMovetoUnpacked(BtCursor *pCur, UnpackedRecord *pIdxKey, i64 intKey, int biasRight, int *pRes) from src/sqlite3.c: 24.66%, B-tree search for entries by key. A time-consuming part is byte-by-byte scanning of pCell memory, plus frequent sqlite3GetVarint calls to read variable-length ints for binary search;
sqlite3VdbeExec(Vdbe *p) from src/sqlite3.c: 22.36%, a Loop+Switch bytecode VM executing compiled SQL statements. VDBE (Virtual Database Engine) is SQLite's execution engine, maintaining a pc scanning bytecodes from the aOp array. Each bytecode is a struct VdbeOp; based on the opcode field, a large switch-case (176 different Ops) is performed. GCC compiles this into a jump table, storing each case's address in an array, computing the target from opcode, then jmp *%rax. Some interpreters use C extensions with computed goto labels, or further jump directly to the next opcode's case at each case's end. Further reading: Android Runtime Interpreter Implementation;
pcache1Fetch(sqlite3_pcache *p, unsigned int iKey, int createFlag) from src/sqlite3.c: 8.26%, a hash table Page Cache for caching disk data in memory, with main bottleneck in pcache1FetchNoMutex's pPage = pCache->apHash[iKey % pCache->nHash]; while( pPage && pPage->iKey!=iKey ){ pPage = pPage->pNext; }, scanning linked list in hash buckets with frequent random accesses;
sqlite3GetVarint(const unsigned char *p, u64 *v) from src/sqlite3.c: 3.70%, recovering variable-length integers from memory (e.g., [0,127] uses one byte, [128,16383] uses two bytes, up to nine bytes). This encoding is quite common and usually saves space.

Classic data structures: B-tree, Loop+Switch interpretation, and hash table lookup. An example VDBE instruction sequence:

sqlite> CREATE TABLE test(key INT, value INT); sqlite> EXPLAIN SELECT * FROM test WHERE key = 1; addr opcode p1 p2 p3 p4 p5 comment ---- ------------- ---- ---- ---- ------------- -- ------------- 0 Init 0 10 0 0 Start at 10 1 OpenRead 0 2 0 2 0 root=2 iDb=0; test 2 Rewind 0 9 0 0 3 Column 0 0 1 0 r[1]= cursor 0 column 0 4 Ne 2 8 1 BINARY-8 84 if r[1]!=r[2] goto 8 5 Column 0 0 3 0 r[3]= cursor 0 column 0 6 Column 0 1 4 0 r[4]= cursor 0 column 1 7 ResultRow 3 2 0 0 output=r[3..4] 8 Next 0 3 0 1 9 Halt 0 0 0 0 10 Transaction 0 0 1 0 1 usesStmtJournal=0 11 Integer 1 2 0 0 r[2]=1 12 Goto 0 1 0 0

It scans every row of the test table, reads the key column, skips to the next row if not equal to 1; if equal, reads all columns and adds to results.

Main bottleneck is memory. 896.3B instructions executed, with 252.4B Loads, 105.1B Stores, 178.0B branches, 1.5B mispredictions, MPKI = 1.5B/896.3B*1000=1.67.

2. cte¶

perf hotspot functions:

sqlite3VdbeExec(Vdbe *p) from src/sqlite3.c: 41.15%, most time in query execution, since this cte workload has complex computations, implementing Sudoku (recursive and non-recursive), Mandelbrot, and testing EXCEPT SELECT syntax via SQL;
sqlite3VdbeRecordCompareWithSkip(int nKey1, const void *pKey1, UnpackedRecord *pPKey2, int bSkip) from src/sqlite3.c: 7.37%, comparing two rows, calling sqlite3VdbeSerialGet to retrieve data then comparing by type;
sqlite3VdbeSerialGet(const unsigned char *buf, u32 serial_type, Mem *pMem) from src/sqlite3.c: 5.95%, deserialization based on stored data type (integer or float), its switch-case also compiled into a jump table;
vdbeSorterSort(SortSubtask *pTask, SorterList *pList) from src/sqlite3.c: 5.95%, merge sort implementation, with main time in function pointer comparator calls and merging based on comparison results.

Bottleneck is mainly the interpreter, similar to CPython. 306.0B instructions, with 82.8B Loads, 39.6B Stores, 62.6B branches, 40.9M mispredictions, MPKI = 40.9M/306.0B*1000=0.13, very low.

3. fp¶

perf hotspot functions:

sqlite3VdbeExec(Vdbe *p) from src/sqlite3.c: 30.66%, query execution with significant floating-point operations in this fp workload;
sqlite3AtoF(const char *z, double *pResult, int length, u8 enc) from src/sqlite3.c: 19.18%, string-to-float conversion since the SQL contains many float literals;
vdbeSorterSort(SortSubtask *pTask, SorterList *pList) from src/sqlite3.c: 10.44%, see above;
sqlite3VdbeRecordCompareWithSkip(int nKey1, const void *pKey1, UnpackedRecord *pPKey2, int bSkip) from src/sqlite3.c: 6.76%, see above.

Bottleneck is mainly the interpreter, with significant time on string-to-float conversion due to SQL design. 554.7B instructions, 132.3B Loads, 61.3B Stores, 111.5B branches, 392.6M mispredictions, MPKI = 392.6M/554.7B*1000=0.71.

Summary¶

Results under different flags:

Workload	Compiler + Flags	Time (s)	Insns (B)	Load (B)	Store (B)	Branch (B)	MPKI
1. main	GCC 14 `-O3`	69	896.3	252.4	105.1	178.0	1.67
1. main	GCC 14 `-O3 -march=native`	73	905.3	273.7	109.9	177.2	1.62
2. cte	GCC 14 `-O3`	12	306.0	82.8	39.6	62.6	0.13
2. cte	GCC 14 `-O3 -march=native`	13	303.6	88.9	40.0	62.6	0.13
3. fp	GCC 14 `-O3`	25	554.7	132.3	61.3	111.5	0.71
3. fp	GCC 14 `-O3 -march=native`	27	555.8	142.7	62.6	111.6	0.69

As shown, sqlite_r is one of those hard-to-optimize benchmarks: heavy memory access, computation, and branching interleaved, heavy on the memory subsystem, hard to vectorize. -O3 -march=native actually increases runtime from 106s to 113s, a regression. Overall: 1760B instructions, 353B branches, MPKI only 1.08, mainly from main.

710.omnetpp_r¶

The familiar 520.omnetpp_r from SPEC INT 2017, but with different workloads. 520.omnetpp_r simulated a 10 Gbps network; 710.omnetpp_r has ten workloads, significantly more diverse:

omnetpp_r -f randomMesh.ini -c General omnetpp_r -f queuenet.ini -c OneFifo omnetpp_r -f queuenet.ini -c TandemFifos omnetpp_r -f queuenet.ini -c SmallCQN omnetpp_r -f queuenet.ini -c Ring omnetpp_r -f queuenet.ini -c Terminal omnetpp_r -f queuenet.ini -c CallCenter omnetpp_r -f queuenet.ini -c ForkJoin omnetpp_r -f queuenet.ini -c ResourceAllocation omnetpp_r -f queuenet.ini -c AllocDealloc

Measured times: 24.6s, 7.8s, 3.8s, 4.6s, 9.1s, 3.7s, 2.6s, 9.4s, 6.6s, and 14.0s, totaling 86.2s. The reftime is 486s, corresponding to 5.6 points.

1. randomMesh¶

Hotspot functions:

omnetpp::cTopology::calculateUnweightedSingleShortestPathsTo(Node *_target) from src/simulator/sim/ctopology.c: 16.22%, classic single-source shortest path (effectively BFS since all edges have unit weight), with bottleneck from random memory access and double-precision floating-point distance computation;
__do_dyncast and __dynamic_cast from libstdc++.so: 4.73%+3.24%+2.22%+0.81%=11.0%, some dynamic_cast usage, e.g., Routing::handleMessage;
Routing::handleMessage(cMessage *msg) from src/model/Routing.cc: 7.10%, simulating routing table, where main logic inlines a std::map<int, int> find operation (Godbolt), querying a red-black tree;
cEvent::shouldPrecede(const cEvent *other) from src/simulator/sim/cevent.cc: 4.64%, multi-key comparison of cEvent structs.

Overall, bottlenecks are spread across many locations. 306.4B instructions, 98.7B Loads, 50.2B Stores, 62.1B branches, 661.2M mispredictions, MPKI = 661.2M/306.4B*1000=2.16. With -O3 -flto, instructions drop to 284.6B (91.3B Loads, 45.4B Stores, 55.7B branches). Further with -O3 -flto -ljemalloc, instructions drop to 279.8B (90.3B Loads, 44.4B Stores, 54.3B branches).

randomMesh under different flags:

Compiler + Flags	Insns (B)	Load (B)	Store (B)	Branch (B)
GCC 14 `-O3`	306.4	98.7	50.2	62.1
GCC 14 `-O3 -flto`	284.6	91.3	45.4	55.7
GCC 14 `-O3 -flto -ljemalloc`	279.8	90.3	44.4	54.3

Remaining 2-10: 9 queuenet workloads¶

perf shows the remaining 9 queuenet workloads' bottlenecks concentrated in:

strcmp (__strcmp_avx2)
dynamic_cast (__do_dyncast and __dynamic_cast)
malloc, free, and operator new
printf (__printf_buffer)

Plus some omnetpp functions (e.g., omnetpp::common::StringPool::obtain(const char *s), mainly querying and modifying std::unordered_map<const char *,int,str_hash, str_eq> pool), scattered around with each under 5%. With such heavy libc/libstdc++ usage, standard library and memory allocator implementations become critical.

Summary¶

Based on the above analysis, different compiler flags were tested:

-O3 -ljemalloc: all ten workloads improve, total from 86.2s to 80.6s, score from 5.6 to 6.0.
-O3 -flto: total from 86.2s to 76.1s, score from 5.6 to 6.4.
-O3 -flto -ljemalloc: total from 86.2s to 69.7s, score from 5.6 to 7.0.

Similar patterns appeared in SPEC INT 2017: -O3 -flto was 3% faster than -O3; -O3 -flto -ljemalloc was 20% faster than -O3 -flto.

Under -O3, total instructions are 1447B, with 291B branches, MPKI = 0.78. Although randomMesh has high MPKI due to graph computation, the overall MPKI is dragged down by other workloads. In comparison, SPEC INT 2017 Rate's 520.omnetpp_r had MPKI of 4.33. Same framework, but workload behavior has changed significantly.

714.cpython_r¶

We just mentioned interpreters, and here comes CPython. The benchmark contains three workloads:

# 1. resnet cpython_r -I -B coreml_pb.py -i 2 -a -m Resnet50Headless.mlmodel -d 10 # 2. mobilenet cpython_r -I -B coreml_pb.py -i 5 -a -c -m MobileNetV2.mlmodel -d 20 # 3. dna cpython_r -I -B dna_bench.py 600000

Runtimes: 31s, 20s, and 20s, total 71s, reftime 479s, corresponding to 6.7 points. With -O3 -flto: 29s, 19s, and 18s, total 66s, 7.3 points. -O3 -ljemalloc has minimal impact; -O3 -march=native causes regression. Detailed analysis follows.

1. resnet¶

Hotspot functions via perf:

_PyEval_EvalFrameDefault(PyThreadState *tstate, _PyInterpreterFrame *frame, int throwflag) from src/cpython/Python/ceval.c: 24.09%, the interpreter's Loop + Switch core, interpreting Python bytecode. Main bottleneck is the jump table (jmp *%rax based on opcode);
PyUnicode_FromFormatV(const char *format, va_list vargs) from src/cpython/Objects/unicodeobject.c: 4.51%, sprintf into Python string, with bottleneck in format string parsing, finding % positions;
_PyObject_Free(void *ctx, void *p) from src/cpython/Objects/obmalloc.c: 3.48%, freeing PyObject. Python has its own allocator for PyObjects rather than using malloc/free directly;
_PyObject_Malloc(void *ctx, size_t nbytes) from src/cpython/Objects/obmalloc.c: 3.15%, allocating PyObject.

The rest is scattered, mainly around the interpreter loop. 651.6B instructions, 180.4B Loads, 104.1B Stores, 136.6B branches, only 7.9M mispredictions, MPKI = 7.9M/651.6B*1000=0.01, negligible. With -O3 -flto: same hotspots, instructions drop to 618.0B (176.6B Loads, 93.9B Stores, 128.6B branches, 48.6M mispredictions).

2. mobilenet¶

Same top-four hotspots with similar proportions, likely because resnet and mobilenet use the same .py source, just different models. 438.9B instructions, 121.4B Loads, 70.5B Stores, 91.6B branches, 9.1M mispredictions, MPKI = 9.1M/438.9B*1000=0.02, negligible. With -O3 -flto: instructions drop to 416.4B (119.0B Loads, 63.8B Stores, 86.2B branches, 35.0M mispredictions).

3. dna¶

Hotspot functions:

_PyEval_EvalFrameDefault(...): 36.75%, see above;
_PyObject_Free(...): 5.31%, see above;
PyUnicode_Contains(PyObject *str, PyObject *substr) from src/cpython/Objects/unicodeobject.c: 4.59%, Python string contains operation, corresponding to char in "GATC" in data/all/input/knucleotide.py;
_PyObject_Malloc(...): 3.52%, see above.

Main hotspot remains interpretation, though PyUnicode_Contains is higher due to frequent string contains calls. 394.9B instructions, 113.3B Loads, 62.1B Stores, 77.1B branches, 228.1M mispredictions, MPKI = 228M/394B*1000=0.58, still very low. With -O3 -flto: 379.3B instructions (113.4B Loads, 58.5B Stores, 71.6B branches, 223.8M mispredictions).

Summary¶

Results under different flags:

Workload	Compiler + Flags	Time (s)	Insns (B)	Load (B)	Store (B)	Branch (B)	Mispredictions (M)
1. resnet	GCC 14 `-O3`	31	651.6	180.4	104.1	136.6	7.9
1. resnet	GCC 14 `-O3 -flto`	29	618.0	176.6	93.9	128.6	48.6
2. mobilenet	GCC 14 `-O3`	20	438.9	121.4	70.5	91.6	9.1
2. mobilenet	GCC 14 `-O3 -flto`	19	416.4	119.0	63.8	86.2	35.0
3. dna	GCC 14 `-O3`	20	394.9	113.3	62.1	77.1	228.1
3. dna	GCC 14 `-O3 -flto`	18	379.3	113.4	58.5	71.6	223.8

714.cpython_r is a typical bytecode interpreter with Loop + Switch structure. Overall MPKI is very low at 0.17; even with -O3 -flto (more mispredictions but fewer total instructions, higher MPKI), the absolute number is still tiny at 0.23.

721.gcc_r¶

502.gcc_r existed in SPEC INT 2017 (based on GCC 4.5.0, compiling gcc-pp.c, gcc-smaller.c, and ref32.c five times each). This time, 721.gcc_r compiles the same three files once each (gcc-pp.c content updated, others unchanged), based on GCC 11.2.0, with simplified command lines:

# 1. gcc-pp cc1_r gcc-pp.c -O2 -fpic -o gcc-pp.c.opts-O2_-fpic.s # 2. gcc-smaller cc1_r gcc-smaller.c -O3 -fipa-pta -o gcc-smaller.c.opts-O3_-fipa-pta.s # 3. ref32 cc1_r ref32.c -O3 -finline-limit=12000 -fno-tree-vrp -o ref32.c.opts-O3_-finline-limit_12000_-fno-tree-vrp.s

-O3 runtimes: 44s, 21s, and 51s, total 116s, reftime 686s, corresponding to 5.9 points. -O3 -flto slightly reduces to 115s; -O3 -flto -ljemalloc further reduces to 111s, mainly targeting the ~2% time spent in malloc/free. -march=native has almost no impact.

Similar to 502.gcc_r (see The Alberta Workloads for the SPEC CPU 2017 Benchmark Suite analysis), 721.gcc_r's time is distributed across many functions. Except ref32 spending 10.76% in dominated_by_p and 5.92% in bitmap_set_bit, other functions are mostly under 3%, with no single dominant hotspot.

bitmap_set_bit(bitmap head, int bit) from src/gcc/bitmap.cc sets a bit in a bitmap using bit operations. Notably, this bitmap can be stored as either a splay tree or linked list. From perf record -e branch-misses:pp, this function's mispredictions mainly come from checking whether the bit is already set before writing. This saves some Store instructions but introduces branch mispredictions. Plus linked list insertion with NULL pointer checks.

dominated_by_p(enum cdi_direction dir, const_basic_block bb1, const_basic_block bb2) from src/gcc/dominance.cc performs basic block dominance queries (A dom B means all paths from entry to B pass through A), which is common in compilers. Due to frequent queries, two DFS passes precompute topological order, then dominance is checked via: DFS_Number_In(A) <= DFS_Number_In(B) && DFS_Number_Out(A) >= DFS_Number_Out(B). The function is simple with precomputed DFS results, but combining two comparisons into cmp+jl and cmp+setle causes branch mispredictions. The && short-circuit means the second condition (with two memory accesses) theoretically shouldn't execute if the first fails. Rewriting to perform both comparisons then AND would eliminate branches but increase memory accesses: Godbolt.

Performance counters for the three workloads:

gcc-pp: 470.2B instructions, 125.6B Loads, 58.8B Stores, 99.9B branches, 2.2B mispredictions, MPKI = 2.2B/470.2B*1000=4.68
gcc-smaller: 243.4B instructions, 65.0B Loads, 30.3B Stores, 51.8B branches, 0.91B mispredictions, MPKI = 0.91B/243.4B*1000=3.74
ref32: 403.7B instructions, 118.9B Loads, 45.8B Stores, 86.1B branches, 0.61B mispredictions, MPKI = 0.61B/403.7B*1000=1.51

Results:

Workload	Compiler + Flags	Time (s)	Insns (B)	Load (B)	Store (B)	Branch (B)	Mispred (B)	MPKI
1. gcc-pp	GCC 14 `-O3`	44	470.2	125.6	58.8	99.9	2.2	4.68
1. gcc-pp	GCC 14 `-O3 -ljemalloc`	42	467.2	125.2	58.7	98.5	2.2	4.71
2. gcc-smaller	GCC 14 `-O3`	21	243.2	65.0	30.3	51.8	0.91	3.74
2. gcc-smaller	GCC 14 `-O3 -ljemalloc`	21	242.1	64.7	30.2	51.2	0.90	3.72
3. ref32	GCC 14 `-O3`	51	403.8	118.9	45.8	86.1	0.61	1.51
3. ref32	GCC 14 `-O3 -ljemalloc`	49	405.2	119.4	46.2	85.8	0.61	1.51

Overall 1120B instructions, 238B branches, MPKI = 3.37, quite high for SPEC INT 2026. For comparison, SPEC INT 2017 Rate's 502.gcc_r had MPKI of 3.13, not much different.

Unsurprisingly, 721.gcc_r compiled with GCC 14 runs faster than when compiled with LLVM 22.

723.llvm_r¶

With LLVM's growth, SPEC CPU 2026 finally includes it. Similar to 721.gcc_r, it runs the LLVM optimizer but with .bc IR files as input rather than C source. Two workloads:

# 1. transformsplus llvm-opt_r transformsplus.bc -S -O3 -mcpu=pwr9 # 2. codegen llvm-opt_r codegen.bc -S -O3 -mcpu=pwr9

-O3 runtimes: 62s and 53s, total 115s, reftime 507s, corresponding to 4.4 points. -O3 -flto actually regresses, but -O3 -ljemalloc gives significant improvement: 59s and 47s, total 106s, 4.8 points. -march=native has almost no impact.

Interestingly, 723.llvm_r compiled with GCC 14 runs faster than with LLVM 22, though the advantage is small. Detailed analysis follows.

1. transformsplus¶

perf hotspots:

llvm::InstCombinerImpl::foldIntegerTypedPHI(llvm::PHINode& PN) from src/lib/Transforms/InstCombine/InstCombinePHI.cpp: 4.06%, processing PHI nodes in IR, with main bottleneck in inner loop traversing use chains with random memory access and LLVM's custom RTTI type checks via branches;
_int_malloc/cfree/malloc: 2.38%+0.89%+0.82%=4.09%, heavy allocation/deallocation, hence -ljemalloc helps;
llvm::DenseMapBase::FindAndConstruct(): 1.69%, LLVM's array-based hash table, with bottleneck in reading hash bucket entries and comparing keys (random access). Recently LLVM has been optimizing this.

Many other small functions with low individual share; time is spread widely like 721.gcc_r. 572.8B instructions, 137.7B Loads, 78.6B Stores, 118.7B branches, 3.5B mispredictions, MPKI = 3.5B/572.8B*1000=6.11, quite high.

From perf record -e branch-misses:pp, mispredictions are spread across many functions. Top-down analysis shows 40% Frontend Bound, 19.2% Bad Speculation. Further analysis reveals L1 ICache misses at 12.6B (L1-icache-load-misses counter), giving L1IC MPKI of 12.6B/572.8B*1000=22.0. The main issue is 723.llvm_r's code size being too large for L1IC, and BTB is likely also strained.

2. codegen¶

perf hotspots:

llvm::InstCombinerImpl::foldIntegerTypedPHI(llvm::PHINode& PN): 20.85%, see above;
_int_malloc/cfree/malloc: 1.91%+0.72%+0.65%=3.28%, see above;
llvm::DenseMapBase::FindAndConstruct(): 1.29%, see above.

Overall similar to transformsplus, with foldIntegerTypedPHI taking a larger share. 415.9B instructions, 100.4B Loads, 57.5B Stores, 86.0B branches, 2.4B mispredictions, MPKI = 2.4B/415.9B*1000=5.77, still high.

Summary¶

Results:

Workload	Compiler + Flags	Time (s)	Insns (B)	Load (B)	Store (B)	Branch (B)	Mispred (B)	MPKI
1. transformsplus	GCC 14 `-O3`	62	572.8	137.7	78.6	118.7	3.5	6.11
1. transformsplus	GCC 14 `-O3 -ljemalloc`	59	563.2	135.7	77.2	115.2	3.3	5.86
2. codegen	GCC 14 `-O3`	53	415.9	100.4	57.5	86.0	2.4	5.77
2. codegen	GCC 14 `-O3 -ljemalloc`	47	411.0	99.3	56.6	84.1	2.3	5.60

LLVM and GCC, twin stars of the compiler world, share similar workload characteristics: heavy memory allocation/deallocation benefiting from -ljemalloc; time spread across many small functions with no dominant hotspot; high MPKI. 723.llvm_r becomes the highest-MPKI benchmark in SPEC INT 2026 Rate at 5.98, likely due to its many data-dependent branches. Overall 991B instructions, 205B branches. Even in SPEC INT 2017 Rate, it would follow closely behind 505.mcf_r and 541.leela_r as the third-highest MPKI.

727.cppcheck_r¶

cppcheck is a C++ static analysis tool that reports issues like array out-of-bounds or uninitialized variables. It analyzes three different codes, seemingly sourced from other benchmarks. 747.dealii (became part of 766.femflow_r) and 770.7z aren't in SPEC CPU 2026 (not selected); only 738 diamond remains as 838.diamond_s:

# 1. 738_diamond cppcheck_r --force 738-diamond-record.cpp --checkers-report=738_report.txt --enable=all --output-file=738_bogey.txt # 2. 747_dealii cppcheck_r --force 747-dealii-data_out_base.cc --checkers-report=747_report.txt --enable=all --output-file=747_bogey.txt # 3. 770_7z cppcheck_r --force 770-7z-SystemPage.cpp --checkers-report=770_report.txt --output-file=770_bogey.txt

Runtimes: 27s, 22s, and 33s, total 82s, reftime 359s, corresponding to 4.4 points. -O3 -flto or -O3 -march=native only improve ~1%, but -O3 -ljemalloc significantly improves to 24s, 18s, and 29s, total 71s, 5.1 points.

1. 738_diamond¶

Hotspot functions:

multiCompareImpl(const Token *tok, const char *haystack, nonneg int varid) from src/lib/token.cpp: 40.82%, string matching, matching a token against abc|def by comparing characters, skipping to next | when no match;
Token::Match(const Token *tok, const char pattern[], nonneg int varid) from src/lib/token.cpp: 12.08%, similar string matching with different syntax (like a custom regex subset), calling multiCompareImpl for partial matching;
ScopeInfo3::findScope(const std::string & scope) from src/lib/tokenize.cpp: 5.49%, searching for symbols starting from current scope upward, with main time in std::list traversal and std::string comparison;
Tokenizer::simplifyUsing(): 3.57%, transforms using N::x; to using x = N::x using Token::Match with patterns like "using ::| %name% ::";
cfree/malloc/_int_malloc: 0.47%+0.33%+0.45%=1.25%.

Main bottleneck is string matching with a simple loop-based implementation, no data structure optimization. 399.9B instructions, 81.2B Loads, 35.5B Stores, 108.9B branches, 173.2M mispredictions, MPKI = 173M/399.9B*1000=0.43, not high.

2. 747_dealii¶

Similar hotspots:

multiCompareImpl(...): 27.42%;
Token::Match(...): 14.55%;
cfree/malloc/_int_malloc: 2.14%+1.57%+0.53%=4.24%, higher allocation share;
Token::simpleMatch(const Token *tok, const char pattern[], size_t pattern_len) from src/lib/token.cpp: 3.88%, another string matching function with different format (e.g., "abc def" means match abc or def), bottleneck in strncmp and memchr;
TemplateSimplifier::addInstantiation(Token *token, const std::string &scope) from src/lib/templatesimplifier.cpp: 2.98%, token-level code transformations, main time in std::list traversal;
isAliasOf(const Token* tok, const Token* expr, int* indirect, bool* inconclusive) from src/lib/astutils.cpp: 2.55%, alias checking.

Lots of string matching with multiple syntax variants and separate implementations; unclear why. 303.9B instructions, 67.3B Loads, 31.5B Stores, 82.5B branches, 298.9M mispredictions, MPKI = 298.9M/303.9B*1000=0.98.

3. 770_7z¶

Hotspots:

multiCompareImpl(...): 32.25%;
Token::Match(...): 18.82%;
__memcmp_avx2_movbe: 8.99%, used for string matching;
std::map<std::string>::equal_range: 7.34%, red-black tree queries plus string matching;
__strchr_avx2: 7.34%, used for string matching;
cfree/malloc/_int_malloc: 0.37%+0.27%+0.17%=0.81%.

Still string-matching dominated. 505.2B instructions, 111.0B Loads, 43.8B Stores, 137.5B branches, 421.0M mispredictions, MPKI = 421M/505.2B*1000=0.83.

Summary¶

Overall, 727.cppcheck_r is constantly doing string matching. A question worth pondering: why not tokenize into numeric IDs for faster comparison? Operating at the token level with string comparisons means the bottleneck is either in cppcheck's own string comparison or libc's.

Results:

Workload	Compiler + Flags	Time (s)	Insns (B)	Load (B)	Store (B)	Branch (B)	Mispred (M)	MPKI
1. 738_diamond	GCC 14 `-O3`	27	399.9	81.2	35.5	108.9	173.2	0.43
1. 738_diamond	GCC 14 `-O3 -ljemalloc`	24	395.0	80.2	34.7	107.5	171.8	0.43
2. 747_dealii	GCC 14 `-O3`	22	303.9	67.3	31.5	82.5	298.9	0.98
2. 747_dealii	GCC 14 `-O3 -ljemalloc`	18	291.0	64.5	29.2	79.0	287.3	0.99
3. 770_7z	GCC 14 `-O3`	33	505.2	111.0	43.8	137.5	421.0	0.83
3. 770_7z	GCC 14 `-O3 -ljemalloc`	29	501.5	110.1	43.2	136.6	409.8	0.82

Overall 1211B instructions, 329B branches; branches account for 27%, the highest in SPEC INT 2026 Rate, all thanks to string matching (read a bit, compare a bit). Yet MPKI is only 0.71, third-lowest in SPEC INT 2026 Rate (above only 714.cpython_r's 0.17 and 750.sealcrypto_r's 0.14), meaning most string matching results are highly predictable (e.g., mismatch at the first byte).

729.abc_r¶

abc is an EDA tool (first encountered through yosys), along with 734.vpr_r, both heavyweight open-source EDA tools implementing logic synthesis and place-and-route respectively. Six workloads:

# 1. twoexact ./abc_r -F twoexact.in # 2. beem6 ./abc_r -F beem6-fraig.in # 3. mem ./abc_r -F mem_ctrl.in # 4. vga ./abc_r -F vga_lcd_miter.in # 5. mcml ./abc_r -F mcml.in # 6. des ./abc_r -F des_system90.in

Runtimes: 6.3s, 10.1s, 13.5s, 32.3s, 13.6s, and 17.0s, total 92.8s, reftime 459s, corresponding to 4.9 points.

Enabling -flto, -march=native, or -ljemalloc provides negligible improvement (within 1%), impervious to all optimizations. Detailed analysis follows.

1. twoexact¶

Hotspot functions:

sat_solver_propagate(sat_solver* s) from src/berkeley-abc/src/sat/bsat/satSolver.c: 75.33%, SAT Solver's Unit Propagation, finding clauses with only one undetermined variable, assigning it, then propagating;
sat_solver_analyze(sat_solver* s, int h, veci* learnt) from src/berkeley-abc/src/sat/bsat/satSolver: 15.85%, conflict analysis as part of CDCL (Conflict Driven Clause Learning);
sat_solver_solve_internal(sat_solver* s) from src/berkeley-abc/src/sat/bsat/satSolver.c: 3.80%, SAT Solver entry point.

Rarely see such concentrated bottlenecks, but indeed, SAT Solvers spend most time in Unit Propagation and CDCL on conflicts. Reminds me of writing a DPLL SAT Solver for a Software Analysis and Verification course long ago. Main bottleneck: memory accesses and data-dependent branches searching the SAT problem's solution space.

53.2B instructions, 13.8B Loads, 3.2B Stores, 8.4B branches, 606.2M mispredictions, MPKI = 606.2M/53.2B*1000=11.39, very high, approaching SPEC INT 2017's 541.leela_r.

Via perf record -e branch-misses:pp, main mispredictions come from sat_solver_propagate's variable value checks, all data-dependent and hard to predict.

2. beem6¶

Hotspot functions:

Cec4_ManPackAddPatterns(Gia_Man_t * p, int iBit, Vec_Int_t * vLits) from src/berkeley-abc/src/proof/cec/cecSatG2.c: 54.65%, CEC (Combinational Equivalence Checking), inner loop iterating vLits entries, updating p->vSims via bit operations;
Cec4_ManGeneratePatterns_rec(Gia_Man_t * p, Gia_Obj_t * pObj, int Value, Vec_Int_t * vPat, Vec_Int_t * vVisit) from src/berkeley-abc/src/proof/cec/cecSatG2.c: 29.01%, recursive processing by pObj type.

Still concentrated hotspots. 255.5B instructions, 57.2B Loads, 7.3B Stores, 40.3B branches, 192.0M mispredictions, MPKI = 192.0M/255.5B*1000=0.75, much lower than SAT.

3. mem¶

Hotspots are still SAT solver-related. Compared to twoexact, sat_solver_canceluntil is higher at 8.46%, but overall characteristics are the same. 151.0B instructions, 43.4B Loads, 15.4B Stores, 24.2B branches, 1213.7M mispredictions, MPKI = 1213.7M/151.0B*1000=8.03, very high.

4. vga¶

Still SAT solver dominated. 490.0B instructions, 143.9B Loads, 54.4B Stores, 76.9B branches, 2092.8M mispredictions, MPKI = 2092.8M/490B*1000=4.27, still high.

5. mcml¶

New hotspot functions appear:

Abc_ObjDeleteFanin(Abc_Obj_t * pObj, Abc_Obj_t * pFanin) from src/berkeley-abc/src/base/abc/abcFanio.c: 12.57%, calls Vec_IntRemove to delete an element by scanning the array and shifting subsequent elements;
Gia_ManSwiSimulate(Gia_Man_t * pAig, Gia_ParSwi_t * pPars) from src/berkeley-abc/src/aig/gia/giaSwitch.c: 8.87%, simulation with significant time in a custom popcount function Gia_WordCountOnes (not recognized as popcnt, using SSE vector software popcount);
Abc_AigAndLookup(Abc_Aig_t * pMan, Abc_Obj_t * p0, Abc_Obj_t * p1) from src/berkeley-abc/src/base/abc/abcAig.c: 7.03%, computing p0 AND p1 with special cases, then hash table linked list traversal with multi-level pointer access: pObj->pNtk->vObjs->pArray;
If_ObjPerformMappingAnd(If_Man_t * p, If_Obj_t * pObj, int Mode, int fPreprocess, int fFirst) from src/map/if/ifMap.c: 6.72%, also significant time in software popcount If_WordCountOnes;
Lpk_NodeCutsOneFilter(Lpk_Cut_t * pCuts, int nCuts, Lpk_Cut_t * pCutNew) from src/berkeley-abc/src/opt/lpk/lpkCut.c: 5.47%, bottleneck in data-dependent comparison branches.

208.0B instructions, 50.1B Loads, 15.4B Stores, 39.8B branches, 534.8M mispredictions, MPKI = 534.8M/208.0B*1000=2.57.

6. des¶

New hotspot functions again:

__strcmp_avx2 from libc: 22.04%, unexpectedly bottlenecked on strcmp again;
Nm_ManTableLookupId(Nm_Man_t * p, int ObjId) from src/misc/nm/nmTable.c: 21.56%, traversing a hash table with chained linked lists;
Nm_ManTableAdd(Nm_Man_t * p, Nm_Entry_t * pEntry) from src/misc/nm/nmTable.c: 12.19%, classic hash table insertion;
Nm_ManTableLookupName(Nm_Man_t * p, char * pName, int Type) from src/misc/nm/nmTable.c: 5.78%, hash table lookup using string matching, explaining the high strcmp count;
Gia_ManSwiSimulate from src/aig/gia/giaSwitch.c: 5.49%, see above;
spec_qsort: 3.98%, familiar from SPEC INT 2017's 505.mcf_r (where qsort bottleneck came from function pointer comparator calls; -flto inlining the pointer gave 13% improvement).

Classic hash table with string matching; bottleneck in hash table queries with poor spatial locality for linked list access.

135.7B instructions, 29.7B Loads, 11.5B Stores, 23.3B branches, 372.9M mispredictions, MPKI = 372.9M/135.7B*1000=2.75. Mispredictions mainly from __strcmp_avx2 and spec_qsort.

Summary¶

Results:

Workload	Compiler + Flags	Time (s)	Insns (B)	Load (B)	Store (B)	Branch (B)	Mispred (M)	MPKI
1. twoexact	GCC 14 `-O3`	6.3	53.2	13.8	3.2	8.4	606.2	11.39
2. beem6	GCC 14 `-O3`	10.1	255.5	57.2	7.3	40.3	192.0	0.75
3. mem	GCC 14 `-O3`	13.5	151.0	43.4	15.4	24.2	1213.7	8.03
4. vga	GCC 14 `-O3`	32.3	490.0	143.9	54.4	76.9	2092.8	4.27
5. mcml	GCC 14 `-O3`	13.6	208.0	50.1	15.4	39.8	534.8	2.57
6. des	GCC 14 `-O3`	17.0	135.7	29.7	11.5	23.3	372.9	2.75

The six workloads touch different abc code paths: SAT, various EDA logic, and hash table lookups with string matching. SAT dominates the weight, giving overall MPKI of 3.87, second only to 723.llvm_r in SPEC INT 2026 Rate, exceeding 721.gcc_r and 777.zstd_r.

734.vpr_r¶

Next comes EDA's next step: after logic synthesis, place-and-route, which is what vpr_r does. Four workloads:

# 1. jpeg_place vpr stratixiv_arch.timing.xml JPEG_stratixiv_arch_timing.blif --RL_agent_placement off --place_algorithm bounding_box --max_criticality 0.0 --init_t 512 --alpha_t 0.75 --exit_t 1 --router_initial_timing all_critical --routing_failure_predictor off --route_chan_width 300 --max_router_iterations 20 --router_lookahead classic --initial_pres_fac 1.0 --pres_fac_mult 2.0 --astar_fac 1.5 --router_profiler_astar_fac 1.5 --seed 3 --sdc_file JPEG_stratixiv_arch_timing.sdc --pack_verbosity 0 --netlist_verbosity 0 --base_cost_type demand_only --inner_num 4 --read_initial_place_file ref_JPEG_stratixiv_arch_timing.init.place --place # 2. jpeg_route vpr stratixiv_arch.timing.xml JPEG_stratixiv_arch_timing.blif --place_algorithm bounding_box --place_static_notiming_move_prob 50 25 25 --max_criticality 0.0 --router_initial_timing all_critical --routing_failure_predictor off --route_chan_width 300 --max_router_iterations 20 --router_lookahead classic --initial_pres_fac 1.0 --pres_fac_mult 2.0 --astar_fac 1.5 --router_profiler_astar_fac 1.5 --seed 3 --sdc_file JPEG_stratixiv_arch_timing.sdc --pack_verbosity 0 --netlist_verbosity 0 --base_cost_type demand_only --place_file ref_JPEG_stratixiv_arch_timing.place --analysis --route # 3. smithwaterman_place vpr stratixiv_arch.timing.xml smithwaterman_stratixiv_arch_timing.blif --RL_agent_placement off --place_algorithm bounding_box --max_criticality 0.0 --init_t 512 --alpha_t 0.75 --exit_t 1 --router_initial_timing all_critical --routing_failure_predictor off --route_chan_width 300 --max_router_iterations 20 --router_lookahead classic --initial_pres_fac 1.0 --pres_fac_mult 2.0 --astar_fac 1.5 --router_profiler_astar_fac 1.5 --seed 3 --sdc_file smithwaterman_stratixiv_arch_timing.sdc --pack_verbosity 0 --netlist_verbosity 0 --base_cost_type demand_only --inner_num 1.8 --read_initial_place_file ref_smithwaterman_stratixiv_arch_timing.init.place --place # 4. smithwaterman_route vpr stratixiv_arch.timing.xml smithwaterman_stratixiv_arch_timing.blif --place_algorithm bounding_box --place_static_notiming_move_prob 50 25 25 --max_criticality 0.0 --router_initial_timing all_critical --routing_failure_predictor off --route_chan_width 300 --max_router_iterations 20 --router_lookahead classic --initial_pres_fac 1.0 --pres_fac_mult 2.0 --astar_fac 1.5 --router_profiler_astar_fac 1.5 --seed 3 --sdc_file smithwaterman_stratixiv_arch_timing.sdc --pack_verbosity 0 --netlist_verbosity 0 --base_cost_type demand_only --place_file ref_smithwaterman_stratixiv_arch_timing.place --analysis --route

The Stratix IV here is the classic Altera FPGA, now a relic of its era. Runtimes: 21s, 29s, 18s, and 19s, total 87s, reftime 461s, 5.3 points. With -O3 -flto: 19s, 25s, 17s, 17s, total 78s, 5.9 points, significant. Further with -O3 -flto -ljemalloc: 17s, 24s, 15s, 16s, total 72s, 6.4 points, 20% over -O3. -march=native adds less than 1%.

1. jpeg_place and 3. smithwaterman_place¶

Both perform placement, analyzed together. Similar hotspots:

get_non_updateable_bb(ClusterNetId net_id, t_bb* bb_coord_new) from src/vtr-vpr/vpr/src/place/place.cpp: jpeg_place 13.98%, smithwaterman_place 18.26%, iterating pins to find bounding box (xmin/xmax/ymin/ymax) by reading x and y coordinates;
try_swap(...) from src/vtr-vpr/vpr/src/place/place.cpp: jpeg_place 12.39%, smithwaterman_place 11.46%, selecting a block to move to an empty position or swap with another, evaluating cost;
physical_tile_type(ClusterBlockId blk) from src/vtr-vpr/vpr/src/util/vpr_utils.cpp: jpeg_place 7.59%, smithwaterman_place 7.75%, indirect indexed memory access, reading coordinates from block_loc, then reading type from grid;
get_bb_from_scratch(ClusterNetId net_id, t_bb* coords, t_bb* num_on_edges) from src/vtr-vpr/vpr/src/place/place.cpp: jpeg_place 6.73%, smithwaterman_place 2.78%, similar bounding box computation;
malloc/_int_malloc/cfree: jpeg_place 3.94%, smithwaterman_place 4.29%.

With -O3 -flto, physical_tile_type gets inlined, saving frequent function call overhead. Given the memory allocation share, -O3 -ljemalloc improvement is expected.

Under -O3: jpeg_place executes 273.7B instructions (84.5B Loads, 26.9B Stores, 51.9B branches, 781.0M mispredictions, MPKI=2.85). smithwaterman_place: 245.0B instructions (76.4B Loads, 24.7B Stores, 45.4B branches, 661.9M mispredictions, MPKI=2.70). Some cmov instructions visible in bounding box min/max computation; on ISAs without cmov, MPKI could be even higher.

2. jpeg_route and 4. smithwaterman_route¶

Routing hotspots differ:

ConnectionRouter<BinaryHeap>::evaluate_timing_driven_node_costs(...): jpeg_route 9.35%, smithwaterman_route 6.91%, computing cost with floating-point;
ConnectionRouter<BinaryHeap>::timing_driven_add_to_heap(...): jpeg_route 9.34%, smithwaterman_route 6.82%, computing cost then inserting into Binary Heap;
ConnectionRouter<BinaryHeap>::timing_driven_expand_neighbours(...): jpeg_route 8.14%, smithwaterman_route 4.00%, expanding neighbor nodes into heap;
ClassicLookahead::get_expected_delay_and_cong(...): jpeg_route 7.86%, smithwaterman_route 5.14%, delay and congestion estimation with floating-point;
BinaryHeap::get_heap_head(): jpeg_route 3.14%, smithwaterman_route 1.64%, classic min binary heap with float comparison;
malloc/_int_malloc/cfree: jpeg_route 2.90%, smithwaterman_route 4.19%.

Looks like cost computation with BinaryHeap selecting minimum cost for expansion, similar to search algorithms.

With -O3 -flto, evaluate_timing_driven_node_costs and timing_driven_add_to_heap are inlined into timing_driven_expand_neighbours. Given the allocation share, -O3 -ljemalloc improvement is expected.

Under -O3: jpeg_route executes 424.1B instructions (130.6B Loads, 50.6B Stores, 79.0B branches, 1094.2M mispredictions, MPKI=2.58). smithwaterman_route: 305.8B instructions (91.0B Loads, 36.0B Stores, 59.4B branches, 609.3M mispredictions, MPKI=1.99).

Summary¶

Results:

Workload	Compiler + Flags	Time (s)	Insns (B)	Load (B)	Store (B)	Branch (B)	Mispred (M)	MPKI
1. jpeg_place	GCC 14 `-O3`	21	273.7	84.5	26.9	51.9	781.0	2.85
1. jpeg_place	GCC 14 `-O3 -flto`	19	247.0	69.2	22.2	47.8	774.2	3.13
1. jpeg_place	GCC 14 `-O3 -ljemalloc`	19	261.5	81.9	25.1	47.9	764.5	2.92
2. jpeg_route	GCC 14 `-O3`	29	424.1	130.6	50.6	79.0	1094.2	2.58
2. jpeg_route	GCC 14 `-O3 -flto`	26	356.6	103.2	33.5	66.3	1075.5	3.02
2. jpeg_route	GCC 14 `-O3 -ljemalloc`	28	411.5	127.9	48.8	74.9	1080.0	2.62
3. smithwaterman_place	GCC 14 `-O3`	18	245.0	76.4	24.7	45.4	661.9	2.70
3. smithwaterman_place	GCC 14 `-O3 -flto`	17	222.1	63.1	20.8	21.8	662.7	2.98
3. smithwaterman_place	GCC 14 `-O3 -ljemalloc`	17	232.9	73.8	23.0	41.4	648.7	2.78
4. smithwaterman_route	GCC 14 `-O3`	19	305.8	91.0	36.0	59.4	609.3	1.99
4. smithwaterman_route	GCC 14 `-O3 -flto`	17	264.3	72.9	25.5	51.5	590.9	2.24
4. smithwaterman_route	GCC 14 `-O3 -ljemalloc`	18	293.6	88.4	34.2	55.3	594.7	2.03

734.vpr_r splits into place (bounding box computation) and route (search and optimization). -flto and -ljemalloc provide significant gains via inlining hotspots and faster allocation. Overall 1254B instructions, 237B branches, MPKI = 2.51, in the upper-middle range.

735.gem5_r¶

gem5 is the well-known simulator; running SPEC CPU 2017 in GEM5 sustained many PhDs. Now the loop is complete: running SPEC INT 2026's GEM5 inside GEM5. Of course, 735.gem5_r's workload isn't SPEC CPU 2026 (no turtles all the way down), but RISC-V Linux kernel boot and memory access sequence generation. Four workloads:

# 1. o3 gem5sim --stats-file=run_riscv_boot.py_o3_10_--max-ticks_10_000_000_000_stats.stats.txt run_riscv_boot.py o3 10 --max-ticks 10_000_000_000 # 2. timing gem5sim --stats-file=run_riscv_boot.py_timing_4_--max-ticks_20_000_000_000.stats.txt run_riscv_boot.py timing 4 --max-ticks 20_000_000_000 # 3. traffic_21 gem5sim --stats-file=synthetic_traffic.py_LinearGenerator_21.stats.txt synthetic_traffic.py LinearGenerator 21 # 4. traffic_74_ruby gem5sim --stats-file=synthetic_traffic.py_LinearGenerator_74_--ruby.stats.txt synthetic_traffic.py LinearGenerator 74 --ruby

Runtimes: 16s, 21s, 21s, and 31s, total 89s, reftime 487s, 5.4 points. Optimization effects:

-O3 -flto: 15s, 20s, 20s, 29s, total 84s, 5.8 points (+6%).
-O3 -flto -ljemalloc: 14s, 18s, 16s, 26s, total 74s, 6.6 points (+20%).
-O3 -march=native -flto -ljemalloc: 12s, 18s, 16s, 26s, total 72s, 6.8 points (+24%). Only the first workload benefits from -march=native.

Given these improvements, we can already guess what bottlenecks we'll find.

1. o3¶

First workload simulates RISC-V Linux boot with O3 CPU. Hotspots:

malloc/_int_malloc/cfree/_int_free_chunk/operator new: 4.78%+3.46%+3.29%+1.35%+1.16%=13.29%, an incredible ratio, but gem5 indeed allocates heavily (e.g., Packet objects);
gem5::TimeBuffer<*>::advance() from src/gem5/cpu/timebuf.hh: 3.05%+2.43%+2.39%+2.28%+1.98%=12.13%, passing data between pipeline stages via rolling time windows, with main time in rep stos or SSE movups memory initialization, plus constructor/destructor with reference counting;
gem5::o3::IEW::tick() from src/gem5/cpu/o3/iew.cc: 3.32%, Issue-Execute-Writeback timing simulation, bottleneck mainly rep stos for data initialization.

Many other scattered small functions. With -O3 -flto, hotspots become one large fused function at 20.80% (the tick() lambda). With -O3 -flto -ljemalloc, allocation drops to 4.67%. -march=native replaces rep stos with AVX2 memset, optimizing TimeBuffer::advance().

Under -O3: 211.1B instructions, 69.9B Loads, 31.7B Stores, 43.2B branches, 175.5M mispredictions, MPKI = 175.5M/211.1B*1000=0.83.

2. timing¶

Second workload uses TimingSimpleCPU (much less complex than O3). Bottleneck shifts to RISC-V architecture code, cache simulation, and allocation:

cfree/malloc/operator new: 12.03%;
gem5::RiscvISA::Decoder::decode(...): 8.97%, RISC-V instruction decode (partially auto-generated) with std::map-based decode cache;
gem5::BaseTags::findBlock(...): 5.19%, set-associative tag comparison;
gem5::PMAChecker::check(...): 4.86%, RISC-V PMA check;
gem5::RiscvISA::ISA::readMiscReg(...): 3.34%, CSR read;
gem5::BaseCache::access(...): 2.84%, cache access simulation;
gem5::PMP::pmpCheck(...): 2.66%, RISC-V PMP check.

With -O3 -flto, readMiscReg is inlined. With -O3 -flto -ljemalloc, allocation drops to 5.82%.

Under -O3: 333.9B instructions, 113.9B Loads, 57.8B Stores, 69.8B branches, 202.9M mispredictions, MPKI = 202.9M/333.9B*1000=0.61.

3. traffic_21¶

Hotspots:

cfree/malloc/operator new: 13.47%;
gem5::SnoopFilter::lookupRequest(...): 5.93%, snoop filtering on bus using std::map;
gem5::AddrRange::removeIntlvBits(...): 3.39%, address interleaving bit removal, with bottleneck in ctz64() (GCC 14 generates loop, GCC 15 generates rep bsfq, with -mbmi generates tzcnt, Godbolt);
gem5::BaseTags::findBlock(...): 3.18%.

With -O3 -flto, removeIntlvBits disappears; with -ljemalloc, allocation drops to 5.47%.

Under -O3: 226.4B instructions, 65.5B Loads, 31.3B Stores, 50.8B branches, 749.3M mispredictions, MPKI = 749.3M/226.4B*1000=3.31, noticeably higher.

4. traffic_74_ruby¶

With ruby enabled, bottlenecks shift to gem5::ruby components:

cfree/malloc/operator new: 10.22%;
gem5::ruby::Cache_Controller::processNextState(...): 4.44%, cache state machine;
gem5::ruby::NetDest::intersectionIsNotEmpty(...): 4.03%, bitset AND operations;
gem5::ruby::MessageBuffer::isReady(...): 3.94%, message queue;
gem5::ruby::Cache_Controller::getDirEntry(...): 3.80%, std::map lookup.

With -O3 -flto, intersectionIsNotEmpty inlined into route (6.45%). With -ljemalloc, allocation drops to 3.84%.

Under -O3: 391.5B instructions, 103.2B Loads, 54.4B Stores, 82.1B branches, 1246.0M mispredictions, MPKI = 1246.0M/391.5B*1000=3.18, still high.

Summary¶

Results:

Workload	Compiler + Flags	Time (s)	Insns (B)	Load (B)	Store (B)	Branch (B)	Mispred (M)	MPKI
1. o3	GCC 14 `-O3`	16	211.1	69.9	31.7	43.2	175.5	0.83
1. o3	GCC 14 `-O3 -ljemalloc`	15	189.5	65.0	28.0	37.0	204.8	1.08
1. o3	GCC 14 `-O3 -flto`	15	193.8	65.0	27.4	39.6	163.5	0.84
2. timing	GCC 14 `-O3`	21	333.9	113.9	57.8	69.8	202.9	0.61
2. timing	GCC 14 `-O3 -ljemalloc`	19	301.8	106.9	51.8	60.5	202.9	0.67
2. timing	GCC 14 `-O3 -flto`	21	324.4	111.6	56.2	67.0	194.7	0.60
3. traffic_21	GCC 14 `-O3`	21	226.4	65.5	31.3	50.8	749.3	3.31
3. traffic_21	GCC 14 `-O3 -ljemalloc`	18	198.0	59.2	26.1	42.7	723.3	3.65
3. traffic_21	GCC 14 `-O3 -flto`	20	216.1	62.8	29.2	48.1	745.4	3.45
4. traffic_74_ruby	GCC 14 `-O3`	31	391.5	103.2	54.4	82.1	1246.0	3.18
4. traffic_74_ruby	GCC 14 `-O3 -ljemalloc`	28	363.6	97.1	49.5	74.1	1200.3	3.30
4. traffic_74_ruby	GCC 14 `-O3 -flto`	29	361.3	96.7	48.6	75.5	1204.0	3.33

735.gem5_r's four tests exercise very different code paths. Due to gem5's high modularity, -flto helps inline functions that could benefit from it. Additionally, gem5 heavily allocates dynamic objects (e.g., Packets), making -ljemalloc effective. -march=native has limited applicability.

Overall: 1164B instructions, 246B branches, MPKI = 2.05, not high, mainly contributed by the two traffic workloads.

750.sealcrypto_r¶

sealcrypto performs homomorphic encryption, with one workload:

sealcrypto_r refrate ecuador_province_capitals_refrate.csv Galapagos

Runtime 108s, reftime 536s, 5.0 points.

Oddly, -O3 -flto regresses; -O3 -flto -ljemalloc has no effect; -O3 -march=native -flto -ljemalloc regresses further. But LLVM 22 dominates with nearly 2x performance, only 50.5s, 10.6 points. It's essentially 750.sealcrypto_r alone that lets LLVM 22 surpass GCC 14 overall on SPEC INT 2026. Let's see why.

First, GCC 14 -O3 hotspot analysis:

seal::util::DWTHandler::transform_to_rev(...) from src/seal/util/dwthandler.h: 25.65%, DWT (Discrete Wavelet Transform), instruction-level: lots of imul/add/shr/shl;
seal::util::DWTHandler::transform_from_rev(...) from src/seal/util/DWTHandler.h: 16.58%, inverse DWT, same computation pattern;
seal::util::multiply_uint64_generic(T operand1, S operand2, unsigned long long *result128) from src/seal/util/uintarith.h: 11.60%, 64-bit * 64-bit = 128-bit multiplication via arithmetic and bit operations;
seal::util::dot_product_mod(...) from src/seal/util/uintarithsmallmod.cpp: 11.48%, dot product with modular reduction using multiply_accumulate_uint64 and barrett_reduce_128;
seal::util::dyadic_product_coeffmod(...) from src/seal/util/polyarithsmallmod.cpp: 9.08%, element-wise modular multiplication;
seal::util::BaseConverter::fast_convert_array(...) from src/seal/util/rns.cpp: 5.88%, RNS (Residue Number System) conversion;
seal::util::RNSTool::sm_mrq(...) from src/seal/util/rns.cpp: 5.40%.

Being cryptography, there's massive integer computation with multiplication and bit operations in prime fields. 3113.4B instructions, 385.7B Loads, 161.3B Stores, 78.5B branches, 450.0M mispredictions, MPKI = 450.0M/3113.4B*1000=0.14, the lowest overall, even below 714.cpython_r. IPC is the highest at 5.09. Top-down: 80.7% Retiring, 13.5% Backend Bound, meaning the processor is running at nearly full throughput.

With -O3 -march=native, AVX2 instructions appear, but the sequences are complex with heavy data shuffling (vpunpcklqdq/vpunpckhqdq/vpermq/vpblendvb/vperm2i128), see Godbolt. Instructions drop to 2757.7B but IPC drops more, resulting in regression from 108s to 116s. The original -O3 version processes one element at a time but with higher ILP, compensating via IPC. GCC 16's -march=native is much better, with fewer shuffles, mostly vpaddq/vpsubq/vpmuludq/vpsllq/vpsrlq compute instructions, see Godbolt.

What did LLVM 22 do? Instructions plummet to 1213.6B (302.8B Loads, 109.2B Stores, 57.2B branches, 1093.9M mispredictions, MPKI=0.90). Taking DWTHandler::transform_to_rev as example: seal implements 6464=128 multiplication generically in multiply_uint64_generic and inlines it; GCC 14 faithfully implements the algorithm with many instructions (Godbolt); but AMD64's mul instruction already does 6464=128, so LLVM 22 recognizes the pattern and compiles to mul (Godbolt, with BMI2 even mulx). Such 64-bit multiply-high instructions exist across ISAs: ARM64's umulh, RISC-V's mulhu, LoongArch's mulh.du. Of course, seal's source already handles this with __int128 when supported. Similar to 706.stockfish_r's 1to6_classical. However, SPEC CPU 2026's compiler neutrality removes such compiler/ISA-dependent code, falling back to the most generic implementation. Only compiler recognition and optimization remains.

This somewhat fails to reflect real-world optimization, since many applications have co-evolved with ISA extensions/compiler extensions, even writing intrinsics (e.g., original stockfish has optimizations for AVX512/AVX2/SSSE3/NEON_DOTPROD/LASX/LSX). Compilers then implement passes to recognize generic fallback code and map back to efficient implementations. Similar to the well-known "compiler recognizes popcount loop and emits popcnt instruction" example; programs often use __builtin_popcount directly. C++20's std::popcount partially addresses this, but came too late.

In contrast, Geekbench is more open to ISA extension optimization (e.g., AMX/SME's dramatic score impact), though this earns it the "AppleBench" moniker.

Meanwhile, LLVM 22 generates significantly more mispredictions. Via perf record -e branch-misses:pp, 46.81% come from sm_mrq, specifically the inlined multiply_uint_mod from src/seal/util/uintarithsmallmod.h, which has a final step: if result >= p, subtract p: SEAL_COND_SELECT(tmp2 >= p, tmp2 - p, tmp2) (familiar from Montgomery Multiplication; Barrett Reduction here, same principle). The SEAL_COND_SELECT macro (with SEAL_AVOID_BRANCHING undefined, using the ternary operator):

#ifndef SEAL_AVOID_BRANCHING #define SEAL_COND_SELECT(cond, if_true, if_false) (cond ? if_true : if_false) #else #define SEAL_COND_SELECT(cond, if_true, if_false) \  ((if_false) ^ ((~static_cast<uint64_t>(cond) + 1) & ((if_true) ^ (if_false)))) #endif

LLVM 22 uses a branch:

# Initialize rax = 0 mov $0x0,%eax # Compare tmp2(rcx) with p(r10) cmp %r10,%rcx # If p > tmp2, jump to label: jb label # rax = r10, i.e., rax = p mov %r10,%rax label: # Compute tmp2 - rax sub %rax,%rcx

Less computation but high branch misprediction rate, unless hardware implements Short Forward Branch to Predication (see Brief Introduction to OoO CPUs (Part 3: Frontend)). GCC 14's approach:

# tmp2 in rax, p in rdx # rcx = rax, i.e., rcx = tmp2 mov %rax,%rcx # rcx -= rdx, i.e., rcx = tmp2 - p sub %rdx,%rcx # Compare tmp2 and p cmp %rdx,%rax # If tmp2 >= p, rax = rcx = tmp2 - p; otherwise rax keeps original tmp2 cmovae %rcx,%rax

GCC 14 avoids massive mispredictions via cmov. This difference alone creates LLVM 22's much higher MPKI. If LLVM 22 used cmov here, performance could improve further. LLVM 22 does use cmov in many places, but why it ultimately chose not to in this specific case requires further investigation.

LLVM 22 with -O3 -march=native improves mispredictions from 1093.9M to 612.7M (MPKI=0.54). The improvement isn't in sm_mrq (still uses branch, not cmov) but in DWTHandler::transform_from_rev and RNSTool::fastbconv_sk. These functions also have SEAL_COND_SELECT, but now cond ? if_true : if_false compiles to vpcmpgtq + vblendvpd, a vectorized cmov equivalent. LLVM 22 refuses cmov for scalar but implements it for vectorization.

750.sealcrypto_r under different compilers and flags:

Compiler + Flags	Time (s)	Insns (B)	Load (B)	Store (B)	Branch (B)	Mispred (M)	MPKI
GCC 14 `-O3`	108	3113.4	385.7	161.3	78.5	450.0	0.14
GCC 14 `-O3 -march=native`	116	2757.7	370.0	126.7	76.1	431.0	0.16
GCC 15 `-O3`	106.4	3071.3	379.1	161.4	80.0	416.1	0.14
GCC 15 `-O3 -march=native`	117.7	2701.9	379.4	130.6	77.6	406.9	0.15
GCC 16 `-O3`	105.9	3020.1	381.1	158.5	80.7	430.3	0.14
GCC 16 `-O3 -march=native`	99.3	2492.3	328.0	123.2	81.8	433.3	0.17
LLVM 22 `-O3`	50.5	1213.6	302.8	109.2	57.2	1093.9	0.90
LLVM 22 `-O3 -march=native`	48.2	1126.0	299.2	108.7	53.4	612.7	0.54

753.ns3_r¶

753.ns3_r is similar to 710.omnetpp_r, also a network discrete event simulator. Workloads:

# 1. mobile ns3_r mobile-scenario --simTimeMinutes=3 --RngSeed=1 --RngRun=1 # 2. tcp ns3_r tcp-pacing --simulationEndTime=500 --useEcn=false --RngSeed=1 --RngRun=1 # 3. lena ns3_r lena-radio-link-failure --numberOfEnbs=2 --interSiteDistance=800 --simTime=200 --RngSeed=1 --RngRun=1 # 4. dctcp ns3_r dctcp-example --enableSwitchEcn=true --flowStartupWindow=0.4 --convergenceTime=0.4 --measurementWindow=0.4 --RngSeed=1 --RngRun=1 # 5. wifi_mixed ns3_r wifi-mixed-network --isUdp=0 --payloadSize=3072 --simulationTime=25 --RngSeed=1 --RngRun=1 # 6. wifi_eht ns3_r wifi-eht-network --simulationTime=0.2 --frequency=5 --useRts=1 --minExpectedThroughput=6 --maxExpectedThroughput=547 --RngSeed=1 --RngRun=1

Runtimes: 18s, 15s, 3s, 19s, 23s, and 14s, total 92s, reftime 613s, 6.7 points. Optimization effects:

-O3 -flto: 16s, 14s, 3s, 17s, 19s, 13s, total 82s, 7.5 points (+12%);
-O3 -flto -ljemalloc: 14s, 12s, 3s, 13s, 18s, 11s, total 71s, 8.6 points (+15% over -flto).

Massive improvements; only -march=native has minimal impact (0.5%).

1. mobile¶

Hotspots:

cfree/malloc/_int_malloc/_int_free_chunk/operator new: 6.99%+5.66%+4.15%+1.83%+1.81%=20.44%, allocation-intensive;
ns3::LteMiErrorModel::GetTbDecodificationStats(...): 9.57%, floating-point accumulation and binary search;
ns3::LteMiErrorModel::Mib(...): 4.39%, floating-point computation;
ns3::LteMiErrorModel::MappingMiBler(...): 3.53%, floating-point, erf function calls, table lookups;
ns3::MapScheduler::Insert(const Event& ev): 2.66%, std::map red-black tree insertion.

Allocation-intensive. With -O3 -flto, Mib inlined into GetTbDecodificationStats. With -ljemalloc, allocation drops to 8.01%.

Unusually for SPEC INT 2026, mobile involves significant floating-point and libm calls (erf/atan2/pow/log), half-stepping into SPEC FP territory but pulled back by heavy libc calls.

Under -O3: 257.2B instructions, 66.6B Loads, 35.4B Stores, 54.4B branches, 631.1M mispredictions, MPKI = 631.1M/257.2B*1000=2.45. Mispredictions mainly from allocator and std::map insertion.

2. tcp¶

Hotspots:

cfree/malloc/_int_malloc/_int_free_chunk/operator new: 19.75%;
ns3::TcpTxBuffer::NextSeg(...): 4.35%, TCP stack implementing RFC 6675 SACK;
ns3::MapScheduler::Insert(...): 4.05%;
__do_dyncast/__dynamic_cast: 3.35%.

Under -O3: 204.8B instructions, 63.5B Loads, 41.4B Stores, 45.4B branches, 148.1M mispredictions, MPKI = 148.1M/204.8B*1000=0.72.

3. lena¶

Hotspots: allocation 20.64%, MapScheduler::Insert 2.41%, dynamic_cast 2.55%.

Under -O3: 46.6B instructions, 14.2B Loads, 9.6B Stores, 10.4B branches, 53.4M mispredictions, MPKI = 53.4M/46.6B*1000=1.15.

4. dctcp¶

Hotspots: allocation 40.61%, MapScheduler::Insert 6.94%.

Under -O3: 225.3B instructions, 71.1B Loads, 43.9B Stores, 52.3B branches, 295.8M mispredictions, MPKI = 295.8M/225.3B*1000=1.31.

5. wifi_mixed¶

Same pattern: allocation + TcpTxBuffer::NextSeg. Under -O3: 291.8B instructions, 88.8B Loads, 52.7B Stores, 66.5B branches, 201.9M mispredictions, MPKI = 201.9M/291.8B*1000=0.69.

6. wifi_eht¶

Hotspots include InterferenceHelper::AppendEvent and WifiSpectrumValueHelper::GetBandPowerW. Under -O3: 194.3B instructions, 58.1B Loads, 32.6B Stores, 44.0B branches, 372.0M mispredictions, MPKI = 372.0M/194.3B*1000=1.91. Mispredictions mainly from std::map queries inlined in InterferenceHelper::AppendEvent.

Summary¶

Results:

Workload	Compiler + Flags	Time (s)	Insns (B)	Load (B)	Store (B)	Branch (B)	Mispred (M)	MPKI
1. mobile	GCC 14 `-O3`	18	257.2	66.6	35.4	54.4	631.1	2.45
2. tcp	GCC 14 `-O3`	15	204.8	63.5	41.4	45.4	148.1	0.72
3. lena	GCC 14 `-O3`	3	46.6	14.2	9.6	10.4	53.4	1.15
4. dctcp	GCC 14 `-O3`	19	225.3	71.1	43.9	52.3	295.8	1.31
5. wifi_mixed	GCC 14 `-O3`	23	291.8	88.8	52.7	66.5	201.9	0.69
6. wifi_eht	GCC 14 `-O3`	14	194.3	58.1	32.6	44.0	372.0	1.91

Similar to 727.cppcheck_r, 753.ns3_r is essentially a memory allocator benchmark, with much time in malloc/free, plus std::map and libm calls. Under -O3: 1221B instructions, 273B branches, MPKI = 1.39.

777.zstd_r¶

The sole compression algorithm in SPEC INT 2026, replacing SPEC INT 2017's 557.xz_r, reflecting compression algorithm evolution. Eight workloads compressing the same file with different compression levels:

# 1. b3 zstd -b3 -e3 --verbose -i40 cld.tar # 2. b5 zstd -b5 -e5 --verbose -i25 cld.tar # 3. b7 zstd -b7 -e7 --verbose -i12 cld.tar # 4. b10 zstd -b10 -e10 --verbose -i6 cld.tar # 5. b14 zstd -b14 -e14 --verbose -i4 cld.tar # 6. b16 zstd -b16 -e16 --verbose -i1 cld.tar # 7. b18 zstd -b18 -e18 --verbose -i1 cld.tar # 8. b19 zstd -b19 -e19 --verbose -i1 cld.tar

Here -b is compression level lower bound, -e is upper bound (both equal = test one level). Runtimes: 11.0s, 14.5s, 13.0s, 11.6s, 24.5s, 10.9s, 20.1s, 25.5s, total 131.2s, reftime 644s, 4.9 points.

-O3 -flto or -O3 -ljemalloc have no improvement, but -O3 -march=native gives a nice 6% boost (total 124.0s, 5.2 points).

Taking b3 as example, hotspots:

ZSTD_compressBlock_doubleFast_noDict_generic from src/zstd-1.5.6/lib/compress/zstd_double_fast.c: 56.82%, hashing data and finding matches for compression;
ZSTD_decompressBlock_internal.part.0 from src/zstd-1.5.6/lib/decompress/zstd_decompress_block.c: 16.63%, decompression logic;
ZSTD_encodeSequences from src/zstd-1.5.6/lib/compress/zstd_compress_sequences.c: 10.91%, bmi2 version disabled by SPEC, using generic version.

Under -O3, b3: 181.4B instructions, 49.9B Loads, 17.7B Stores, 19.1B branches, 543.9M mispredictions, MPKI = 543.9M/181.4B*1000=3.00. 78.98% mispredictions from ZSTD_compressBlock_doubleFast_noDict_generic (e.g., if (MEM_read64(matchl0) == MEM_read64(ip))).

b5 hotspots: ZSTD_RowFindBestMatch 67.91%, ZSTD_compressBlock_lazy_generic 9.12%. Under -O3: 273.6B instructions, MPKI = 2.06.

b14 hotspots: ZSTD_DUBT_findBestMatch 85.74%. Under -O3: 197.6B instructions, MPKI = 1609.6M/197.6B*1000=8.15, extremely high.

b16 hotspots: ZSTD_insertBtAndGetAllMatches 38.62%, ZSTD_insertBt1 35.15%. Under -O3: 129.1B instructions, MPKI = 5.05.

b7/b10 are similar to b5; b18/b19 are similar to b16. zstd uses different paths based on compression level, trading compression ratio for speed.

With -march=native: BMI instructions (bzhi, tzcnt) and three-operand non-flag-affecting operations (shrx) reduce instruction counts, similar to corresponding RISC-V instructions. Results before and after:

Workload	Compiler + Flags	Time (s)	Insns (B)	Load (B)	Store (B)	Branch (B)	Mispred (M)	MPKI
1. b3	GCC 14 `-O3`	11.0	181.4	49.9	17.7	19.1	543.9	3.00
1. b3	GCC 14 `-O3 -march=native`	10.5	170.4	49.9	18.3	18.9	543.8	3.19
2. b5	GCC 14 `-O3`	14.5	273.6	61.3	35.1	28.4	562.4	2.06
2. b5	GCC 14 `-O3 -march=native`	14.0	250.5	59.7	35.4	28.3	559.1	2.23
3. b7	GCC 14 `-O3`	13.0	228.5	48.9	25.8	29.8	599.3	2.62
3. b7	GCC 14 `-O3 -march=native`	12.7	207.4	46.6	26.0	29.8	596.7	2.88
4. b10	GCC 14 `-O3`	11.6	207.2	41.5	17.6	32.6	516.3	2.49
4. b10	GCC 14 `-O3 -march=native`	11.5	184.0	37.8	17.8	32.6	569.6	3.10
5. b14	GCC 14 `-O3`	24.5	197.6	48.8	16.5	29.1	1609.6	8.15
5. b14	GCC 14 `-O3 -march=native`	23.7	190.1	46.7	15.9	27.8	1612.5	8.48
6. b16	GCC 14 `-O3`	10.9	129.1	29.9	11.2	18.0	652.1	5.05
6. b16	GCC 14 `-O3 -march=native`	10.2	124.7	30.7	12.0	17.3	646.5	5.18
7. b18	GCC 14 `-O3`	20.1	265.8	57.0	17.0	32.6	987.7	3.72
7. b18	GCC 14 `-O3 -march=native`	18.4	259.2	57.0	17.2	31.4	980.7	3.78
8. b19	GCC 14 `-O3`	25.5	342.0	72.9	19.1	41.8	1060.6	3.10
8. b19	GCC 14 `-O3 -march=native`	23.4	332.8	72.7	19.1	40.1	1050.2	3.16

Overall under -O3: 1827B instructions, 232B branches, MPKI = 3.58, third-highest after 729.abc_r and 723.llvm_r.

Discussion¶

Compiler Flags Comparison¶

Compilation flags significantly impact SPEC INT 2026 Rate performance:

-flto helps 707.ntest_r, 710.omnetpp_r, 714.cpython_r, 734.vpr_r, 735.gem5_r, 753.ns3_r. When hotspots are spread across many small functions, LTO essentially recovers performance lost to file-splitting for readability;
-ljemalloc helps 710.omnetpp_r, 721.gcc_r, 723.llvm_r, 727.cppcheck_r, 734.vpr_r, 735.gem5_r, 753.ns3_r. These programs do too much dynamic allocation, some benchmarks are essentially allocator benchmarks, where replacing glibc with jemalloc/mimalloc provides nice improvement (latest glibc is also improving malloc, unclear how much);
-march=native helps 706.stockfish_r, 707.ntest_r, 735.gem5_r, 777.zstd_r. Partially SIMD (for ARM64, e.g., Apple M2, it's the USDOT instruction giving 706.stockfish_r +33%; without i8mm extension, -march=native has no effect), partially bit manipulation instructions (popcnt, BMI). Many real-world programs already account for hardware acceleration, often using intrinsics directly, but SPEC disables these, falling back to generic versions that depend heavily on -march=native and compiler pattern recognition.

Other common flags like -static, -fomit-frame-pointer, -Ofast, -ffast-math haven't been extensively tested yet.

Compiler Version Comparison¶

The primary compiler is GCC 14.2.0 (Debian Trixie's version). Interestingly, even in 2026, with hardware unchanged, software performance continues growing with compiler updates. GCC 15 generates faster SSE/AVX sequences for 706.stockfish_r; LLVM 22 recognizes 750.sealcrypto_r's 64-bit multiplication pattern. Additionally, LLVM defaults to inlining popcount's optimized implementation while GCC calls libgcc's popcount; the former bloats code, the latter adds call overhead. These specific optimizations can be cross-ported. In SPEC INT 2017 era, GCC dominated LLVM; now LLVM gains ground via 750.sealcrypto_r, then gets overtaken again by GCC 15/16. As SPEC CPU 2026 research deepens, faster programs will be compiled.

Branch Prediction¶

SPEC INT 2026 Rate benchmarks with high MPKI:

723.llvm_r MPKI=5.98
729.abc_r MPKI=3.87
777.zstd_r MPKI=3.58
721.gcc_r MPKI=3.37
734.vpr_r MPKI=2.52
707.ntest_r MPKI=2.27
735.gem5_r MPKI=2.05

For comparison, SPEC INT 2017 Rate:

505.mcf_r MPKI=14.39
541.leela_r MPKI=12.62
557.xz_r MPKI=5.29
531.deepsjeng_r MPKI=4.40
520.omnetpp_r MPKI=4.33
502.gcc_r MPKI=3.13

SPEC INT 2026 Rate is significantly lower overall. Of course, these are per-benchmark averages; individual workloads may be higher. But regardless, no more battling 505.mcf_r's spec_qsort and 541.leela_r's if(randint(2) == 0). That said, SPEC INT 2026 Rate still has many MPKI contributions from std::map red-black trees and other data structures with data-dependent branches, not necessarily easy to optimize in hardware. Applications are becoming aware of branch prediction, using ternary operators to hint compilers to generate cmov instructions.

Limitations¶

Current testing is limited to Intel i9-14900K P-Core; similar analysis is needed on ARM64/RISC-V/LoongArch. Different ISAs likely lead to different conclusions. Additionally, analysis focuses on perf-reported hotspot functions; finer-grained analysis (instruction type distributions, POPCNT/BMI/AVX usage) would be valuable.

Only Rate 1 (single copy) was tested. Multi-copy runs would stress memory bandwidth and cache contention more, potentially changing MPKI, IPC, etc. significantly. Analysis focuses on instruction-level and branch prediction, lacking microarchitecture-level deep analysis (L1/L2/LLC miss rates, TLB misses) more directly useful for processor designers. Power data wasn't considered; energy efficiency ratio needs RAPL measurement. Finally, PGO (-fprofile-generate / -fprofile-use) wasn't attempted and could potentially bring nice improvements.

Conclusion¶

This article provides in-depth analysis of SPEC CPU 2026 INT Rate workloads, for reference by compiler and processor designers. From the compiler perspective, combining the best of GCC and LLVM can further improve performance; from the processor perspective, optimizing for program bottlenecks can further improve scores.

SPEC CPU 2026 负载特性分析（INT Rate 篇）

Fri, 22 May 2026 00:00:00 +0000

SPEC CPU 2026 负载特性分析（INT Rate 篇）¶

本文同步发布到本人的知乎。

English version

背景¶

最近用 SPEC CPU 2026 跑了一些基准测试，打算结合测试结果做一些深入的负载特性分析。本篇主要是分析 SPEC INT 2026 Rate 的负载特性，SPEC FP 2026 Rate 的分析请看 FP Rate 篇。

本文测试环境：CPU 为 Intel i9-14900K P-Core @ 5.7 GHz，Linux 发行版为 Debian Trixie，编译器是 GCC 14.2.0，默认编译选项是 -O3。其实这颗 CPU 最快能 Boost 到 6.0 GHz，但时不时因为未知原因（防缩缸？）在单核负载下也 Boost 不上去，具体表现为每跑一段时间后 CPU 核心就会强制降频到 4.7 GHz。故退而求其次，选择在更容易稳定达到的 5.7 GHz 频率来跑。能稳定跑到 6.0 GHz 的只有那一个物理 P 核，其他 P 核也都能上 5.7 GHz，降频了换一个核心即可。6.0 GHz 下的性能可以参考之前的测试结果：INT 和 FP，基本上，从 5.7 GHz 到 6.0 GHz，性能可以按频率线性放缩。本文可能针对同一个负载给出多个不同的运行时间，这可能是因为多次运行导致的性能波动，也可能是因为部分数字包含了 perf record 的开销，不过误差都很小，可以放心对比。本文所用的脚本已开源到 jiegec/spec2026。

SPEC INT 2026 Rate 分析¶

706.stockfish_r¶

stockfish 是一个著名的国际象棋引擎，该基准测试包括如下三个负载：

# 1. 1to6_classical stockfish bench 1600 1 26 spec_ref_pos_1to6.fen depth classical # 2. 1to6_nnue stockfish bench 1600 1 26 spec_ref_pos_1to6.fen depth nnue # 3. 7to11_nnue stockfish bench 1600 1 26 spec_ref_pos_7to11.fen depth nnue

实测数据显示，三个负载耗费的时间分别是 47s、77s 和 72s，共计 196s。reftime 是 1260s，对应 6.4 分。开启 -march=native 后，1to6_classical 时间缩短 10% 到 43s，而 1to6_nnue 和 7to11_nnue 时间明显缩短到 32s 和 31s，总时间 105s，对应 12 分，分数提升显著。下面逐一分析这三个负载的性能特性。

1. 1to6_classical¶

通过 perf 观察性能瓶颈，以下列出 1to6_classical 的主要热点函数及其时间占比（后续各基准测试均采用相同表示方法）：

Stockfish::Eval::evaluate(const Position& pos) 来自 src/evaluate.cpp: 19.16%，inline 了 Evaluation<NO_TRACE>(pos).value() 的调用，里面主要是对局面的评估，涉及比较多零散的访存和计算，没有特别集中的热点指令；
Stockfish::TranspositionTable::probe(const Key key, bool& found) 来自 src/tt.cpp: 17.91%，主要的瓶颈来自于随机访存，在 first_entry(key) 当中有 &table[mul_hi64(key, clusterCount)].entry[0] 的代码，其中 mul_hi64 计算两个 64 位整数乘法结果的高 64 位，因此访存地址是根据参数计算得出；对于 mul_hi64，GCC 14 会忠实地按照源码把 64 位拆分成高低 32 位分别计算，而 LLVM 22 能够正确识别出这段代码的意图，并直接用 AMD64 的 mul 指令实现，这个功能在 PR #168396 中实现，mul_hi64 对应 PR 描述中的 Ladder；事实上，Stockfish 原本的代码里会用 __int128，此时 GCC 14 也能生成高效的代码，只可惜因为用到了 C 语法扩展，被 SPEC 禁用了（汇编对比见 Godbolt）；
Stockfish::MovePicker::next_move(bool skipQuiets) 来自 src/movepick.cpp: 10.36%，里面比较慢的是 partial_insertion_sort，找到插入位置后，还要把原来数组里靠后的元素往后挪，留出空间用于插入元素；
Stockfish::search(Position& pos, Stack* ss, Value alpha, Value beta, Depth depth, bool cutNode) 来自 src/search.cpp: 9.49%，搜索逻辑主要在这里实现；
__popcountdi2 来自 libgcc: 7.52%，被 Stockfish::Eval::evaluate(const Position& pos) 调用，用来判断局面上满足某种条件，内部实现就是位运算，有兴趣的读者可以阅读 Hacker's Delight 这本书。

开了 -march=native 后，能观察到 __popcountdi2 被内联为 popcnt 指令。经过测试，开 -mpopcnt 后时间即从 47s 降低到 44s，接近 -march=native 的性能。可见仅开启 popcnt 指令集并消除 __popcountdi2 的函数调用开销，就能带来明显的性能提升。

-O3 编译选项下，1to6_classical 执行的指令数为 531.8B（instructions 性能计数器），其中 Load 指令有 135.7B 条（mem_inst_retired.all_loads 性能计数器），Store 有 59.7B 条（mem_inst_retired.all_stores 性能计数器），分支指令有 56.0B 条（branch-instructions 性能计数器），其中有 2622.8M 次错误预测（branch-misses 性能计数器）。可见，1to6_classical 的 MPKI 还是比较高的：2622.8M/531.8B*1000=4.93。即使是在 SPEC INT 2017 当中，这一数值也高于 531.deepsjeng_r 的 3.16 和 557.xz_r 的 3.49，低于 505.mcf_r 的 6.24 和 541.leela_r 的 7.71。

使用 perf record -e branch-misses:pp，观察到主要的分支错误预测来自于 Stockfish::MovePicker::next_move() 函数，贡献了 27.48% 的错误预测，主要是插入排序的部分，一是循环找到插入的位置，二是循环搬运数组内原有元素。其次是 Stockfish::Eval::evaluate() 函数，贡献了 17.42% 的错误预测。再其次是 Stockfish::search() 函数，贡献了 13.06% 的错误预测。

开 -O3 -mpopcnt 后，指令数减少到 453.9B，其中 Load 有 124.2B 条，Store 有 53.1B 条，分支指令有 46.1B 条，错误预测还是 2.6B 次，光是内联 __popcountdi2 的调用，便可减少 77.9B 条指令，约占原来的 15%。__popcountdi2 本身的实现包括 21 条指令，此外还有 __popcountdi2@plt 里的一次 jmp，和 call __popcountdi2@plt 本身和前后保存和恢复寄存器的开销。

2. 1to6_nnue¶

后两个负载的引擎从 classical 变为了 nnue，涉及神经网络，因此它的计算模式会不太一样。通过 perf 观察到 1to6_nnue 的主要耗时函数：

Stockfish::Eval::NNUE:evaluate(const Position& pos, bool adjusted) 来自 src/nnue/evaluate_nnue.cpp：80.59%，主要耗时在 affine_transform_non_ssse3 的 sum += weights[offset + j] * input[j]，即神经网络的推理过程，它的计算过程是，进行 int8_t 乘 uint8_t，再累加到 int32_t 类型的结果，默认编译选项下，只能用基础的 SSE 指令如 pmaddwd/paddd，而不能用 AVX；
Stockfish::TranspositionTable::probe(const Key key, bool& found) 来自 src/tt.cpp: 仅 4.81%，瓶颈和前面分析的一样是随机访存。

分析 Stockfish::Eval::NNUE:evaluate 的指令，可以看到，它为了实现上述逻辑，核心思路是采用 pmaddwd 指令，进行 4 次 16 位有符号的乘法计算，累加到 32 位的结果。但是，在这之前，需要先把输入的 8 位有符号 weights 和无符号 input 转换到 16 位有符号数。其中 8 位有符号 weights 转换比较简单，而 8 位无符号 input 的处理逻辑比较复杂。首先，它对 input 的每个元素加上 128，然后当成有符号数来看待，这相当于对每个元素减去了 128，把 uint8_t 映射到了 int8_t。这样，input 就可以用和 weights 相同的方法进行符号扩展。但是，这样会导致结果计算错误，为了纠正这个偏差，又减去了 128 倍的 weights 之和。汇编代码如下（Godbolt）：

1: # 加载有符号 weights 的 16 个元素 movdqu (%rdx,%rcx,1),%xmm2 movdqa %xmm5,%xmm8 # 加载无符号 input 的 16 个元素 movdqa (%r12,%rcx,1),%xmm10 add $0x10,%rcx # 对 weights 进行符号扩展 pcmpgtb %xmm2,%xmm8 movdqa %xmm2,%xmm9 # 每个 input 元素加上 128，即减去 128 转为有符号 int8_t paddb %xmm6, %xmm10 # 符号扩展 weights punpckhbw %xmm8,%xmm2 punpcklbw %xmm8,%xmm9 movdqa %xmm2,%xmm11 movdqa %xmm9,%xmm8 # 计算 weights 之和乘以 128 pmaddwd %xmm3,%xmm11 pmaddwd %xmm7,%xmm8 paddd %xmm11,%xmm0 paddd %xmm8,%xmm0 paddd %xmm11,%xmm0 movdqa %xmm5,%xmm11 # 对 input 进行符号扩展 pcmpgtb %xmm10,%xmm11 paddd %xmm8,%xmm0 movdqa %xmm10,%xmm8 punpckhbw %xmm11,%xmm10 punpcklbw %xmm11,%xmm8 # 计算 weights * input pmaddwd %xmm10,%xmm2 pmaddwd %xmm8,%xmm9 # 结果累加 paddd %xmm2,%xmm0 paddd %xmm9,%xmm0 cmp $0x400,%rcx jne 1b

对于这种适合 SIMD 的代码，开启 -march=native 后通常会有明显的性能提升，实际测试也证明了这一点，开了 -march=native 后，时间从 77s 降低到 32s，Stockfish::Eval::NNUE::evaluate 时间占比降到 54.20%，此时主要的计算指令变为 AVX-VNNI 扩展的 vpdpbusd (Multiply and Add Unsigned and Signed Bytes) 指令，即针对字节（weights 数组元素是 int8_t 类型，input 数组元素是 uint8_t 类型）元素的整数乘加融合指令，和的类型是 int32_t。核心循环如下（Godbolt）：

1: # 加载无符号 input vmovdpa (%r8,%rcx,1),%ymm0 # 加载有符号 weights 并计算 sum += weights[offset + j] * input[j] {vex} vpdpbusd (%rdx,%rcx,1),%ymm0,%ymm2 add $0x20,%rcx cmp $0x400,%rcx jne 1b

如果 CPU 支持 AVX512-VNNI，还能进一步扩展到 512 的位宽：vpdpbusd (%rdx,%rax), %zmm1, %zmm0。需要注意的是，单纯开 -mavx2 仅能把时间从 77s 减少到 50s，距离 -march=native 的 32s 还有明显的差距：即使开启了 AVX（Godbolt），由于没有开 AVX-VNNI，不能用 vpdpbusd 指令，还是需要先格式转换到 16 位，再用 32 位累加器的 16 位整数乘加指令。Stockfish 的 NNUE 这样的计算方式，就是奔着 vpdpbusd 这条指令去的。因此缺乏这类指令的 CPU，或者虽有指令但编译器未加利用，性能就会明显落后。

例如在 ARM64 下，对应的 USDOT (Dot product with unsigned and signed integers (vector)) 指令被包括在 i8mm 扩展当中，有这个扩展的话，-march=native 性能提升显著（Godbolt），例如 Apple M2；而如果没有这个扩展，开不开 -march=native 就没什么区别，例如 Apple M1，此时就要回退到类似 AMD64 那样，先扩展到 16 位，再求和（Godbolt）。RISC-V Vector 指令集扩展则有 vwmulsu.vv 指令可以使用，得到 16 位乘法结果之后，再用 vwadd.wv 指令累加到 32 位（Godbolt）。LoongArch 也有对应的 xvmulwev.h.b/xvmulwod.h.b 指令，得到 16 位乘法结果之后，用 xvhaddw.w.h 指令累加到 32 位（Godbolt），还可以进一步优化为用 xvmulwev.h.bu.b 指令，优化后的 transform 函数性能相比 GCC 16 快 37%。

除了是否开启对应指令集扩展以外，还观察到 GCC 15 在 1to6_nnue 上相比 GCC 14 有明显的性能提升（编译选项为 -O3），时间从 77s 降低到了 49s。观察生成的指令，虽然仍使用 SSE 指令，但指令序列更简洁（Godbolt）：

# %xmm5 初始化为全零 1: # 加载有符号 weights 的 16 个元素 movdqu (%rdx,%rcx,1),%xmm4 movdqa %xmm5,%xmm8 # 加载无符号 input 的 16 个元素 movdqa (%r12,%rcx,1),%xmm2 add $0x10,%rcx # 将 weights 和零比较，非负得 0，负数得 0xFF pcmpgtb %xmm4,%xmm8 movdqa %xmm2,%xmm6 movdqa %xmm4,%xmm7 # 把 input 从 8 位无符号扩展到 16 位，保存到 %xmm2 和 %xmm6 punpckhbw %xmm5,%xmm2 punpcklbw %xmm5,%xmm6 # 结合前面的 pcmpgtb，把 weights 从 8 位有符号扩展到 16 位，保存到 %xmm4 和 %xmm7 punpckhbw %xmm8,%xmm4 punpcklbw %xmm8,%xmm7 # 每条 pmaddwd 指令进行 4 次 16-bit * 16-bit + 16-bit * 16-bit = 32-bit 的计算 # 两条 pmaddwd 共完成 8 次 16-bit 乘法和 8 次 32-bit 加法 pmaddwd %xmm4,%xmm2 pmaddwd %xmm7,%xmm6 # 每条 paddd 指令进行 4 次 32 bit 的累加 paddd %xmm2,%xmm0 paddd %xmm6,%xmm0 cmp $0x400,%rcx jne 1b

可见，即使没有专用的 vpdpbusd 指令，仅用 SSE 也仍有优化空间。GCC 15 通过 SSE 高效实现了有符号和无符号数的符号扩展，获得了介于 GCC 14 次优指令序列与专用 vpdpbusd 指令之间的性能。这在 SPEC CPU2026: Characterization, Representativeness, and Cross-Suite Comparison 论文中也有提及：For example, gcc-15 reduces the instruction count of 706.stockfish_r by up to 3x，不过这个数字是相比 GCC 13 的；相比 GCC 14 也有减少，不过没有那么明显，详情见论文中的 Figure 10 和 Figure 16，这里实测下来是从 GCC 14 的 1342B 条指令降低到 GCC 15 的 1015B。相比之下，LLVM 22 生成的 SSE（-O3，Godbolt）或 AVX（-O3 -march=alderlake，Godbolt）指令都没有 GCC 15 高效。

-O3 编译选项下，1to6_nnue 执行的指令数为 1342.1B，其中 Load 指令有 182.2B 条，Store 指令有 61.8B 条，128 位整数向量指令（如 SSE）有 229.1B 条（int_vec_retired.128bit 性能计数器），分支指令有 77.6B 条，其中有 1612.9M 次错误预测。它的 MPKI 只有 1612.9M/1342.1B*1000=1.20，主要瓶颈还是在上述的神经网络推理当中。

GCC 15 用 -O3 编译选项下，1to6_nnue 执行的指令数减少到 1015.3B，其中 Load 指令有 175.0B 条，Store 指令有 57.8B 条，128 位整数向量指令只有 97.0B 条，分支指令有 77.4B 条，优化效果明显。

GCC 14 用 -march=native 编译选项下，1to6_nnue 执行的指令数锐减到 446.8B，只剩下三分之一的指令数了，其中 Load 指令有 119.6B 条，Store 指令有 44.4B 条，分支指令有 48.7B 条，256 位的 AVX VNNI 指令有 13.2B 条（int_vec_retired.vnni_256 性能计数器），优化效果明显。

3. 7to11_nnue¶

7to11_nnue 的行为与 1to6_nnue 类似，瓶颈也是在 Stockfish::Eval::NNUE:evaluate 函数上。开启 -march=native 后，时间从 72s 降到了 31s。GCC 15 的性能提升也和 1to6_nnue 类似，从 72s 降低到 46s。

-O3 编译选项下，7to11_nnue 执行的指令数为 1253.2B，其中 Load 指令有 176.1B 条，Store 指令有 61.6B 条，128 位整数向量指令有 212.5B 条，分支指令有 75.4B 条，其中有 1547.5M 次错误预测。它的 MPKI 只有 1547.5M/1253.2B*1000=1.23，主要瓶颈还是在神经网络推理当中。

GCC 15 用 -O3 编译选项下，7to11_nnue 执行的指令数减少到 955.3B，其中 Load 指令有 169.4B 条，Store 指令有 57.8B 条，128 位整数向量指令只有 92.3B 条，分支指令有 75.2B 条，优化效果明显。

GCC 14 用 -march=native 编译选项下，7to11_nnue 执行的指令数锐减到 425.9B，只剩下三分之一的指令数了，其中 Load 指令有 115.1B 条，Store 指令有 43.7B 条，分支指令有 47.1B 条，256 位的 AVX VNNI 指令有 12.0B 条，优化效果明显。

小结¶

各负载在不同编译选项下的情况如下：

负载	编译器 + 选项	时间 (s)	指令 (B)	Load (B)	Store (B)	分支 (B)	错误预测次数 (M)	MPKI	128 位整数向量 (B)	256 位整数向量 (B)
1. 1to6_classical	GCC 14 `-O3`	47	531.8	135.7	59.7	56.0	2622.8	4.93	0.13	0.00
1. 1to6_classical	GCC 14 `-O3 -mpopcnt`	44	453.9	124.2	53.1	46.1	2639.3	5.81	0.13	0.00
2. 1to6_nnue	GCC 14 `-O3`	77	1342.1	182.2	61.8	77.6	1612.9	1.20	229.1	0.00
2. 1to6_nnue	GCC 15 `-O3`	49	1015.3	175.0	57.8	77.4	1258.2	1.24	97.0	0.00
2. 1to6_nnue	GCC 14 `-march=native`	32	446.8	119.6	44.4	48.7	953.8	2.13	5.1	36.3
3. 7to11_nnue	GCC 14 `-O3`	72	1253.2	176.1	61.6	75.4	1547.5	1.23	212.5	0.00
3. 7to11_nnue	GCC 15 `-O3`	46	955.3	169.4	57.8	75.2	1224.7	1.28	92.3	0.00
3. 7to11_nnue	GCC 14 `-march=native`	31	425.9	115.1	43.7	47.1	922.9	2.17	4.6	35.0

1to6_classical 类似传统的棋类引擎，有比较复杂的分支和访存，所以它的 MPKI=4.93 比较类似 SPEC CPU 2017 的 531.deepsjeng_r（MPKI=3.16），属于比较高的一类。而 1to6_nnue 和 7to11_nnue 的主要瓶颈在于 i8 的矩阵运算，能否用上硬件的加速指令（这里是 AVX-VNNI）对性能影响很大，分支预测瓶颈就明显小了。整体平均下来的 MPKI 是 1.85，并不算高。

707.ntest_r¶

ntest 是黑白棋的引擎，该基准测试包括如下负载：

ntest_r Othello.154.ggf 20 16

实测数据显示，运行这个负载耗费的时间是 140s。reftime 是 592s，对应 4.2 分。开启各项优化编译选项，-O3 -flto 相比 -O3 能带来 4% 的性能提升，进一步 -O3 -flto -march=native 相比 -O3 -flto 还能带来 10% 的性能提升。下面分析它的具体负载特性。黑白棋的规则很简单：只有在某个空位落子能翻转至少一个对方棋子时，才能下子，否则就要轮空。翻转的规则是，沿横、竖、斜八个方向检查，如果该方向上从新落子到另一颗己方棋子之间全是对方棋子，则这些对方棋子全部翻转。通过 perf 观察性能瓶颈，这几个函数耗费的时间占比较多：

flips(int sq, u64 mover, u64 enemy) 来自 src/flips.cpp：34.80%，最主要的开销，根据棋盘状态，经过一系列的访存和位运算，先通过 neighbors[sq]&enemy 判断是否有敌方邻居棋子（无则无法下子），再计算下子后会翻转哪些棋子，主要是一些数据依赖的访存，混合了一堆位运算；
solveNParity(int alpha, int beta, u64 mover, u64 enemy, u64 parity, EndgameSearch* search, bool hasPassed) 来自 src/solve.cpp：14.21%，进行 alpha-beta 减枝的 minimax 算法（negamax 变种），遍历棋盘上的空位置，首先找到那些满足 good parity 的位置（用 bitSet() 函数，汇编上是用 AMD64 的 bt 指令判断，因为黑白棋里，双方轮流下子，走最后一步的玩家获得一定的优势，所以先找那些能让自己走最后一步的位置），调用上述 flips() 看看是否会出现翻转，如果会出现翻转就尝试下子并进行递归，之后再遍历一次，这次遍历 bad parity 的位置，流程相同，主要的瓶颈在访存以及依赖访存结果的分支；
__popcountdi2：9.65%，因为没开 -mpopcnt/-march=native，故需要它来代替 popcnt 指令，用来计算场面上各颜色棋子的数量等等；
solveNFlipParity：8.95%，与 solveNParity 配合完成 minimax 算法；
solve2：5.38%，minimax 算法的一部分，处理棋盘只有两个空位的最终局面，此时判断最终胜败是比较容易的，不需要再递归。

这也是典型的棋类引擎模式，整个 minimax 算法占了 70%+ 的时间，为了搜索局面，有大量的位运算和访存，还有根据访存结果决定方向的分支。果不其然，执行 2688.3B 条指令，其中有 647.8B 条 Load 指令，255.2B 条 Store 指令，228.2B 条是分支指令，有 6.1B 次错误预测，MPKI 达到了 6.1B/2688B*1000=2.27。通过 perf record -e branch-misses:pp，看到 solveNParity 和 solveNFlipParity 一起贡献了 60.37% 的错误预测，主要就是上面说的，循环内对 good 还是 bad parity 的判断，以及链表插入时是否为 NULL 的判断，都是方向依赖数据的分支。

和 706.stockfish_r 类似，它也有不少的 popcnt 调用，那么打开 -mpopcnt 就会得到不错的性能提升：时间从 140s 降低到 126s，减少 11% 时间，指令数减少到 2286.9B，其中有 586.9B 条 Load 指令，206.7B 条 Store 指令，187.6B 条分支指令。而即使开 -march=native，性能也只是进一步降到 122s，只有少量的地方用到了 AVX2。

另一方面，LLVM 22 的性能在 707.ntest_r 上比 GCC 14 要快：同样是 -O3 的编译选项，运行时间从 GCC 14 的 140s 降低到 126s。深入研究汇编发现，LLVM 22 在没有开 -mpopcnt 的时候，它的行为是，直接把类似 libgcc 的 __popcountdi2 的代码内联到了程序当中，省去了 call libgcc 的开销，不过代价就是代码体积会增加，实际执行了 2416.9B 条指令，其中有 542.7B 条 Load 指令，202.9B 条 Store 指令，168.2B 条分支指令。类似地，706.stockfish_r 的 1to6_classical 也是 LLVM 22 比 GCC 14 快，从 47s 降低到 44s。

同时，GCC 15 相比 GCC 14 也有性能提升，运行时间从 140s 降低到了 130s。分析汇编，发现主要优化点在 flips(int sq, u64 mover, u64 enemy) 函数当中。性能区别有两点：

首先是对 callee-saved 寄存器的使用，GCC 14 会在 epilogue/prologue 直接进行一系列的 push/pop，而 GCC 15 更加聪明，仅在 if (neighbors[sq]&enemy) 条件成立的情况下，需要执行复杂函数体，需要 callee-saved 寄存器时才会进行 push/pop，否则就直接 ret，因为检查条件的时候并没有用到 callee-saved 寄存器，避免了保存和恢复。
自己编译的 GCC 15 默认是 -no-pie 模式，而发行版的 GCC 14 默认是 -pie，而 -no-pie 模式因为采用绝对地址，可以在 imul 等指令的操作数直接访问内存，节省寄存器，此时不再需要 callee-saved register，直接免去了 push/pop 的开销，开启 -static 也能带来类似的效果。上面的第一条分析是手动给 GCC 15 开 -pie 后观察到的。不过主要的性能提升还是来自于减少 push/pop 的执行次数。

GCC 15 编译的 707.ntest_r，实际执行 2429.3B 条指令，其中有 610.9B 的 Load 指令，206.2B 的 Store 指令，224.7B 的分支指令。707.ntest_r 在不同编译器和编译选项下的情况如下：

编译器 + 选项	时间 (s)	指令 (B)	Load (B)	Store (B)	分支 (B)
GCC 14 `-O3`	140	2688.3	647.8	255.2	228.2
GCC 14 `-O3 -flto`	134	2656.3	623.4	251.3	200.9
GCC 14 `-O3 -mpopcnt`	126	2286.9	586.9	206.7	187.6
GCC 14 `-O3 -march=native`	122	2230.0	588.2	206.4	185.2
LLVM 22 `-O3`	126	2416.9	542.7	202.9	168.2
GCC 15 `-O3`	130	2429.3	610.9	206.2	224.7

结合 706.stockfish_r 和 707.ntest_r 可以看到，popcnt 还是比较常用的。但可惜 AMD64 的基线并不提供这条指令，因此开了 x86-64-v2 或以上的编译优化选项后，这类应用便可以通过一条 popcnt 指令免去 libgcc 的 __popcountdi2 调用开销，节省因额外 call 及 PLT 带来的性能损失。相比 AVX-VNNI，popcnt 的普及程度就要大得多了。

708.sqlite_r¶

sqlite 就是大名鼎鼎的数据库了，不必多介绍。该基准测试包括三个负载：

# 1. main sqlite_r --memdb --size 2000 --testset main --verify # 2. cte sqlite_r --memdb --size 2000 --testset cte --verify # 3. fp sqlite_r --memdb --size 1000 --testset fp --verify

实测数据显示，三个负载耗费的时间分别是 69s、12s 和 25s，共计 106s。reftime 是 528s，对应 5.0 分。开启 -flto/-ljemalloc 对性能影响很小，-march=native 甚至带来了负优化。下面逐一分析这三个负载的性能特性。

1. main¶

通过 perf 观察性能瓶颈，这几个函数耗费的时间占比较多：

sqlite3BtreeMovetoUnpacked(BtCursor *pCur, UnpackedRecord *pIdxKey, i64 intKey, int biasRight, int *pRes) 来自 src/sqlite3.c：24.66%，在 Btree 上进行搜索，根据 key，查找对应的 entry，中间一个比较耗时的部分是逐字节扫描 pCell 指向的内存，此外还会经常调用 sqlite3GetVarint 获取 pCell 保存的变长 int 来实现二分搜索；
sqlite3VdbeExec(Vdbe *p) 来自 src/sqlite3.c：22.36%，用 Loop+Switch 实现的执行字节码的虚拟机，执行编译好的 SQL 语句，VDBE 是 SQLite 的执行引擎，全称是 Virtual Database Engine，模拟过程会维护一个 pc，从 aOp 数组里扫描字节码，每个字节码是一个 struct VdbeOp 结构体，根据它的 opcode 字段进行一个大的 switch-case，一共有 176 种不同的 Op；gcc 把这个巨大的 switch-case 编译成了跳转表，也就是把各个 case 的地址保存到一个数组当中，根据 opcode 计算出对应 case 的地址，再 jmp *%rax 过去，执行完 case 的代码后，再跳回 switch 开头，读取下一个 opcode，再跳转；目前有一些解释器会直接用 C 的扩展，用 computed goto label 的写法来帮助编译器做这个优化，或者更进一步直接在每个 case 的最后跳转到下一个 opcode 对应的 case，拓展阅读： Android Runtime 解释器的实现探究；
pcache1Fetch(sqlite3_pcache *p, unsigned int iKey, int createFlag) 来自 src/sqlite3.c：8.26%，对应一个用哈希表维护的 Page Cache，用于在内存里缓存硬盘上的数据，主要瓶颈在 pcache1FetchNoMutex 里的 pPage = pCache->apHash[iKey % pCache->nHash]; while( pPage && pPage->iKey!=iKey ){ pPage = pPage->pNext; }，对哈希表的桶里的链表做一个扫描，随机访存比较多；
sqlite3GetVarint(const unsigned char *p, u64 *v) 来自 src/sqlite3.c：3.70%，恢复内存中可变长度的整数，比如 [0,127] 范围的数字用一个字节保存，[128,16383] 范围的数字用两个字节保存，更大的数字则要更长，最多到九个字节，这种压缩表示还挺常见的，多数时候可以节省空间。

都是一些比较经典的数据结构和算法的应用，Btree，Loop+Switch 的解释执行，加哈希表查询。一段 Vdbe 指令序列的例子如下：

sqlite> CREATE TABLE test(key INT, value INT); sqlite> EXPLAIN SELECT * FROM test WHERE key = 1; addr opcode p1 p2 p3 p4 p5 comment ---- ------------- ---- ---- ---- ------------- -- ------------- 0 Init 0 10 0 0 Start at 10 1 OpenRead 0 2 0 2 0 root=2 iDb=0; test 2 Rewind 0 9 0 0 3 Column 0 0 1 0 r[1]= cursor 0 column 0 4 Ne 2 8 1 BINARY-8 84 if r[1]!=r[2] goto 8 5 Column 0 0 3 0 r[3]= cursor 0 column 0 6 Column 0 1 4 0 r[4]= cursor 0 column 1 7 ResultRow 3 2 0 0 output=r[3..4] 8 Next 0 3 0 1 9 Halt 0 0 0 0 10 Transaction 0 0 1 0 1 usesStmtJournal=0 11 Integer 1 2 0 0 r[2]=1 12 Goto 0 1 0 0

能看到它的实现方式是，扫描 test 表的每一行，读取 key 列，如果不等于 1，则直接进入下一行；如果等于 1，则把所有列读出来，加入到结果当中。

这个负载的主要瓶颈在内存上。执行了 896.3B 条指令，其中 252.4B 是 Load 指令，105.1B 是 Store 指令，178.0B 是分支指令，错误预测了 1.5B 次，MPKI 是 1.5B/896.3B*1000=1.67。

2. cte¶

通过 perf 观察性能瓶颈，这几个函数耗费的时间占比较多：

sqlite3VdbeExec(Vdbe *p) 来自 src/sqlite3.c：41.15%，主要时间花费在查询的执行，因为这个 cte 负载，其计算过程比较复杂，用 SQL 实现了数独（递归和非递归版本）、Mandelbrot，还测试了 EXCEPT SELECT 语法；
sqlite3VdbeRecordCompareWithSkip(int nKey1, const void *pKey1, UnpackedRecord *pPKey2, int bSkip) 来自 src/sqlite3.c：7.37%，比较表里的两个行，会调用 sqlite3VdbeSerialGet 获取行内的数据，再根据数据类型进行对应的比较；
sqlite3VdbeSerialGet(const unsigned char *buf, u32 serial_type, Mem *pMem) 来自 src/sqlite3.c：5.95%，反序列化，根据内存中保存的数据类型，解析对应的数据，比如整数或者浮点，它的 switch-case 也被 GCC 编译成了跳转表；
vdbeSorterSort(SortSubtask *pTask, SorterList *pList) 来自 src/sqlite3.c：5.95%，实现归并排序，主要时间是在通过函数指针调用比较器函数，以及根据比较结果进行归并。

瓶颈主要在解释器上，与 CPython 解释器的行为模式类似。执行了 306.0B 条指令，其中 82.8B 是 Load 指令，39.6B 是 Store 指令，62.6B 是分支指令，错误预测了 40.9M 次，MPKI 是 40.9M/306.0B*1000=0.13，处于很低的水平。

3. fp¶

通过 perf 观察性能瓶颈，这几个函数耗费的时间占比较多：

sqlite3VdbeExec(Vdbe *p) 来自 src/sqlite3.c：30.66%，主要时间花费在查询的执行，因为这个 fp 负载，其计算过程引入了不少浮点运算；
sqlite3AtoF(const char *z, double *pResult, int length, u8 enc) 来自 src/sqlite3.c：19.18%，实现从字符串到浮点数的转换，因为 SQL 内有很多浮点字面量；
vdbeSorterSort(SortSubtask *pTask, SorterList *pList) 来自 src/sqlite3.c：10.44%，描述见上；
sqlite3VdbeRecordCompareWithSkip(int nKey1, const void *pKey1, UnpackedRecord *pPKey2, int bSkip) 来自 src/sqlite3.c：6.76%，描述见上。

瓶颈主要在解释器上，不过因为 SQL 语句的设计，有很多时间花在字符串转浮点数上。执行了 554.7B 条指令，其中 132.3B 是 Load 指令，61.3B 是 Store 指令，111.5B 是分支指令，错误预测了 392.6M 次，MPKI 是 392.6M/554.7B*1000=0.71。

小结¶

各负载在不同编译选项下的情况如下：

负载	编译器 + 选项	时间 (s)	指令 (B)	Load (B)	Store (B)	分支 (B)	MPKI
1. main	GCC 14 `-O3`	69	896.3	252.4	105.1	178.0	1.67
1. main	GCC 14 `-O3 -march=native`	73	905.3	273.7	109.9	177.2	1.62
2. cte	GCC 14 `-O3`	12	306.0	82.8	39.6	62.6	0.13
2. cte	GCC 14 `-O3 -march=native`	13	303.6	88.9	40.0	62.6	0.13
3. fp	GCC 14 `-O3`	25	554.7	132.3	61.3	111.5	0.71
3. fp	GCC 14 `-O3 -march=native`	27	555.8	142.7	62.6	111.6	0.69

通过上面的分析，可见 sqlite_r 确实是比较难优化的那一类，大量访存、计算和分支混合在一起，对内存子系统的负担比较重，难以向量化，开 -O3 -march=native 后运行时间从 106s 增加到 113s，产生了负优化。整体来看，执行了 1760B 条指令，其中有 353B 条是分支指令，MPKI 仅有 1.08，主要由 main 贡献。

710.omnetpp_r¶

SPEC INT 2017 就有的老面孔 520.omnetpp_r，不过运行的内容也和以往不同。520.omnetpp_r 做的是 10 Gbps 网络的模拟，而 710.omnetpp_r 有足足十项负载，负载的多样性有了明显的增强。十项负载的命令行参数如下：

omnetpp_r -f randomMesh.ini -c General omnetpp_r -f queuenet.ini -c OneFifo omnetpp_r -f queuenet.ini -c TandemFifos omnetpp_r -f queuenet.ini -c SmallCQN omnetpp_r -f queuenet.ini -c Ring omnetpp_r -f queuenet.ini -c Terminal omnetpp_r -f queuenet.ini -c CallCenter omnetpp_r -f queuenet.ini -c ForkJoin omnetpp_r -f queuenet.ini -c ResourceAllocation omnetpp_r -f queuenet.ini -c AllocDealloc

实测数据显示，十个负载耗费的时间分别是 24.6s、7.8s、3.8s、4.6s、9.1s、3.7s、2.6s、9.4s、6.6s 和 14.0s，共计 86.2s。reftime 是 486s，对应 5.6 分。

1. randomMesh¶

首先分析第一个负载的热点函数：

omnetpp::cTopology::calculateUnweightedSingleShortestPathsTo(Node *_target) 来自 src/simulator/sim/ctopology.c：16.22%，实现了经典的单源最短路算法，且由于每条边的权重都是一，实际上就是 BFS，主要瓶颈来自于随机访存和计算距离的双精度浮点运算；
__do_dyncast 和 __dynamic_cast 来自 libstdc++.so：4.73%+3.24%+2.22%+0.81%=11.0%，代码中有一些 dynamic_cast 的使用，如 Routing::handleMessage；
Routing::handleMessage(cMessage *msg) 来自 src/model/Routing.cc：7.10%，模拟路由表的功能，主要逻辑是内联了一个 std::map<int, int> 的 find 操作（Godbolt），在一个红黑树上进行查询，读取结点，比较 key，走左子树或右子树继续查询；
cEvent::shouldPrecede(const cEvent *other) 来自 src/simulator/sim/cevent.cc：4.64%，一个 cEvent 结构体的多关键字比较函数。

整体来看，它的瓶颈分散在比较多的地方。执行了 306.4B 条指令，其中有 98.7B 条 Load 指令，50.2B 条 Store 指令，62.1B 条分支指令，错误预测 661.2M 次，MPKI 为 661.2M/306.4B*1000=2.16。开 -O3 -flto 后，指令数减少到 284.6B，其中有 91.3B 条 Load 指令，45.4B 条 Store 指令，55.7B 条分支指令。进一步开 -O3 -flto -ljemalloc，指令数进一步减少到 279.8B，其中有 90.3B 条 Load 指令，44.4B 条 Store 指令，54.3B 条分支指令。

randomMesh 在不同编译选项下的情况如下：

编译器 + 选项	指令 (B)	Load (B)	Store (B)	分支 (B)
GCC 14 `-O3`	306.4	98.7	50.2	62.1
GCC 14 `-O3 -flto`	284.6	91.3	45.4	55.7
GCC 14 `-O3 -flto -ljemalloc`	279.8	90.3	44.4	54.3

其余的 2-10 共 9 个 queuenet 负载¶

用 perf 观察，其余 9 个 queuenet 负载的瓶颈主要集中在这些函数：

strcmp（__strcmp_avx2）
dynamic_cast（__do_dyncast 和 __dynamic_cast）
malloc、free 和 operator new
printf（__printf_buffer）

还有些 omnetpp 自己的函数（如 omnetpp::common::StringPool::obtain(const char *s)，主要是对 std::unordered_map<const char *,int,str_hash, str_eq> pool 进行查询和修改操作），散落各处，每个函数都只占用不到 5% 的时间。对于这么大比例使用 libc/libstdc++ 中函数的情况，标准库和内存分配器的实现就很重要了。

小结¶

基于以上分析，尝试了不同的编译选项，结果如下：

开 -O3 -ljemalloc 后，十个负载的性能都有了一定的提升，总时间从 86.2s 降低到 80.6s，分数从 5.6 分提升到 6.0 分。
开 -O3 -flto 也能带来不错的提升，总时间从 86.2s 降低到 76.1s，分数从 5.6 分提升到 6.4 分。
开 -O3 -flto -ljemalloc，则总时间从 86.2s 降低到 69.7s，分数从 5.6 分提升到 7.0 分。

类似现象在 SPEC INT 2017 中就曾出现，-O3 -flto 比 -O3 快 3%，-O3 -flto -ljemalloc 比 -O3 -flto 快 20%。

-O3 下，执行的指令数是 1447B，其中 291B 是分支指令，MPKI 是 0.78。虽然 randomMesh 因为图计算，MPKI 比较高，但整体的 MPKI 被其余负载拉低了。相比之下，SPEC INT 2017 Rate 的 520.omnetpp_r 的 MPKI 足足有 4.33。虽然还是同一个框架，但是负载行为还是出现了明显的变化。

714.cpython_r¶

前面才提到过解释器，这就到 CPython 了。该基准测试包含三个负载：

# 1. resnet cpython_r -I -B coreml_pb.py -i 2 -a -m Resnet50Headless.mlmodel -d 10 # 2. mobilenet cpython_r -I -B coreml_pb.py -i 5 -a -c -m MobileNetV2.mlmodel -d 20 # 3. dna cpython_r -I -B dna_bench.py 600000

三个负载的运行时间分别为 31s、20s 和 20s，总时间 71s，reftime 是 479s，对应 6.7 分。开启 -O3 -flto 后，三个负载的运行时间分别为 29s、19s 和 18s，总时间 66s，对应 7.3 分。-O3 -ljemalloc 影响很小，-O3 -march=native 有负优化。下面具体分析三个负载的负载特性。

1. resnet¶

还是用 perf，统计出热点函数：

_PyEval_EvalFrameDefault(PyThreadState *tstate, _PyInterpreterFrame *frame, int throwflag) 来自 src/cpython/Python/ceval.c：24.09%，解释器中的 Loop + Switch 核心代码，对 Python 字节码进行解释执行，主要的瓶颈也是跳转表，根据 opcode 计算 case 地址然后 jmp *%rax；
PyUnicode_FromFormatV(const char *format, va_list vargs) 来自 src/cpython/Objects/unicodeobject.c，4.51%，把结果写到 Python 字符串的 sprintf 版本，主要瓶颈是格式化字符串的解析，找 % 的位置；
_PyObject_Free(void *ctx, void *p) 来自 src/cpython/Objects/obmalloc.c：3.48%，释放 PyObject，Python 有一个自己的针对 PyObject 的内存分配器，而不是直接使用 malloc/free；
_PyObject_Malloc(void *ctx, size_t nbytes) 来自 src/cpython/Objects/obmalloc.c：3.15%，分配 PyObject。

剩下就比较零散了，主要还是围绕着解释器的循环。执行了 651.6B 条指令，其中有 180.4B 是 Load 指令，104.1B 是 Store 指令，136.6B 是分支指令，错误预测仅 7.9M 次，MPKI 等于 7.9M/651.6B*1000=0.01，可以忽略不计。开启 -O3 -flto 后，热点函数不变，指令数降低为 618.0B，其中 Load 有 176.6B，Store 有 93.9B，分支有 128.6B，错误预测 48.6M 次。

2. mobilenet¶

统计出热点函数，发现前四依然是上面四个，且时间占比差不多。可能是因为，resnet 和 mobilenet 负载用的是同一个 .py 源码，只是用的模型不同。执行了 438.9B 条指令，其中有 121.4B 是 Load 指令，70.5B 是 Store 指令，91.6B 是分支指令，错误预测 9.1M 次，MPKI 等于 9.1M/438.9B*1000=0.02，可以忽略不计。开启 -O3 -flto 后，热点函数不变，指令数降低为 416.4B，其中 Load 指令有 119.0B，Store 指令有 63.8B，分支有 86.2B，错误预测 35.0M 次。

3. dna¶

统计热点函数：

_PyEval_EvalFrameDefault(PyThreadState *tstate, _PyInterpreterFrame *frame, int throwflag) 来自 src/cpython/Python/ceval.c：36.75%，描述见上；
_PyObject_Free(void *ctx, void *p) 来自 src/cpython/Objects/obmalloc.c：5.31%，描述见上；
PyUnicode_Contains(PyObject *str, PyObject *substr) 来自 src/cpython/Objects/unicodeobject.c，4.59%，Python 字符串的 contains 操作，对应 data/all/input/knucleotide.py 代码中的 chat in "GATC" 判断；
_PyObject_Malloc(void *ctx, size_t nbytes) 来自 src/cpython/Objects/obmalloc.c：3.52%，描述见上。

主要热点还是解释执行，不过因为字符串的 contains 调用次数较多，所以 PyUnicode_Contains 时间占比有所上升。执行了 394.9B 条指令，其中有 113.3B 是 Load 指令，62.1B 是 Store 指令，77.1B 是分支指令，错误预测 228.1M 次，MPKI 等于 228M/394B*1000=0.58，也还是很低。开启 -O3 -flto 后，热点函数不变，指令数降低为 379.3B，其中 Load 有 113.4B，Store 有 58.5B，分支有 71.6B，错误预测 223.8M 次。

小结¶

各负载在不同编译选项下的情况如下：

负载	编译器 + 选项	时间 (s)	指令 (B)	Load (B)	Store (B)	分支 (B)	错误预测 (M)
1. resnet	GCC 14 `-O3`	31	651.6	180.4	104.1	136.6	7.9
1. resnet	GCC 14 `-O3 -flto`	29	618.0	176.6	93.9	128.6	48.6
2. mobilenet	GCC 14 `-O3`	20	438.9	121.4	70.5	91.6	9.1
2. mobilenet	GCC 14 `-O3 -flto`	19	416.4	119.0	63.8	86.2	35.0
3. dna	GCC 14 `-O3`	20	394.9	113.3	62.1	77.1	228.1
3. dna	GCC 14 `-O3 -flto`	18	379.3	113.4	58.5	71.6	223.8

714.cpython_r 就是一个典型的基于字节码的解释器，在一个 Loop + Switch 结构当中完成解释执行。整体 MPKI 很低，只有 0.17，即使开了 -O3 -flto，虽然预测错误多了，总指令数少了，MPKI 会变大，但绝对数字也还是很小，只有 0.23。

721.gcc_r¶

SPEC INT 2017 中的 502.gcc_r 便已存在，当时基于 GCC 4.5.0，针对 gcc-pp.c、gcc-smaller.c 和 ref32.c 进行五次编译，这次 721.gcc_r 对着三个同名文件（其中 gcc-pp.c 内容更新了，其余两个不变）分别进行一次编译，基于 GCC 11.2.0 版本，命令行参数如下，相比 502.gcc_r 有所简化：

# 1. gcc-pp cc1_r gcc-pp.c -O2 -fpic -o gcc-pp.c.opts-O2_-fpic.s # 2. gcc-smaller cc1_r gcc-smaller.c -O3 -fipa-pta -o gcc-smaller.c.opts-O3_-fipa-pta.s # 3. ref32 cc1_r ref32.c -O3 -finline-limit=12000 -fno-tree-vrp -o ref32.c.opts-O3_-finline-limit_12000_-fno-tree-vrp.s

-O3 运行时间分别为 44s、21s 和 51s，总时间 116s，reftime 是 686s，对应 5.9 分。开了 -O3 -flto 后，时间略微降低到 115s，开 -O3 -flto -ljemalloc 后时间进一步降低到 111s，主要针对的是占用时间约 2% 的 malloc/free。开 -march=native 对性能几乎没有影响。

与 502.gcc_r 的行为类似（见 The Alberta Workloads for the SPEC CPU® 2017 Benchmark Suite 的分析），721.gcc_r 的时间分布在大量函数，除了 ref32 花费了 10.76% 的时间在 dominated_by_p、5.92% 的时间在 bitmap_set_bit 以外，其他函数的占用时间基本都在 3% 以下，没有一个特别明显的热点函数。

其中 bitmap_set_bit(bitmap head, int bit) 函数来自 src/gcc/bitmap.cc，通过位运算，在 bitmap 里把一个 bit 设为一，比较特别的是，这个 bitmap 可以有二叉树（splay tree）和链表两种保存格式。从 perf record -e branch-misses:pp 来看，这个函数主要是在设置 bit 的时候出现了一些分支预测的错误：它首先读取 bitmap 原来的数值，判断该 bit 是否已经设置，只有之前没设置的情况下，才会更新 bitmap。这样的好处是，可以节省一些 Store 指令，但也带来了一些分支的错误预测。此外就是链表的插入逻辑，需要判断指针是否为空。

另外，dominated_by_p(enum cdi_direction dir, const_basic_block bb1, const_basic_block bb2) 函数来自 src/gcc/dominance.cc，做的是基本块的 dominance 查询，A dom B 代表从函数入口到 B 一定会经过 A，这是编译器中很常见的一个查询，由于查询次数很多，会预先通过两遍 dfs（一遍从上往下，一遍从下往上，上对应入口，下对应出口）找到基本块的拓扑顺序，然后根据拓扑排序的结果来判断是否有 A dom B 的关系：DFS_Number_In(A) <= DFS_Number_In(B) && DFS_Number_Out(A) >= DFS_Number_Out(B)，也就是从上往下遍历（In）的时候，先到达 A，然后从下往上遍历（Out）的时候，先到达 B。其实这个函数并不复杂，而且 DFS 已经提前算好了，这里只需要读取计算好的结果，但是因为它把两次比较做成了一次 cmp+jl 和一次 cmp+setle，导致容易出现分支预测错误。从逻辑上来说，这里可以改成完成两次比较，再对结果取 AND，但由于代码里是 && 有短路的性质，理论上第一个条件成立了，就不该进行第二个条件，更何况第二个条件里还涉及两次访存。这种实现确实可能省下一些访存，但分支预测也变难了。如果改写代码，先进行两次比较，再进行 && 操作，就没有分支指令了，不过访存次数也确实变多了：Godbolt。

三次运行的性能计数器如下：

gcc-pp: 执行 470.2B 条指令，其中有 125.6B 条 Load 指令，58.8B 条 Store 指令，99.9B 条分支指令，错误预测 2.2B 次，MPKI 等于 2.2B/470.2B*1000=4.68
gcc-smaller: 执行 243.4B 条指令，其中有 65.0B 条 Load 指令，30.3B 条 Store 指令，51.8B 条分支指令，错误预测 0.91B 次，MPKI 等于 0.91B/243.4B*1000=3.74
ref32: 执行 403.7B 条指令，其中有 118.9B 条 Load 指令，45.8B 条 Store 指令，86.1B 条分支指令，错误预测 0.61B 次，MPKI 等于 0.61B/403.7B*1000=1.51

各负载的情况如下：

负载	编译器 + 选项	时间 (s)	指令 (B)	Load (B)	Store (B)	分支 (B)	错误预测 (B)	MPKI
1. gcc-pp	GCC 14 `-O3`	44	470.2	125.6	58.8	99.9	2.2	4.68
1. gcc-pp	GCC 14 `-O3 -ljemalloc`	42	467.2	125.2	58.7	98.5	2.2	4.71
2. gcc-smaller	GCC 14 `-O3`	21	243.2	65.0	30.3	51.8	0.91	3.74
2. gcc-smaller	GCC 14 `-O3 -ljemalloc`	21	242.1	64.7	30.2	51.2	0.90	3.72
3. ref32	GCC 14 `-O3`	51	403.8	118.9	45.8	86.1	0.61	1.51
3. ref32	GCC 14 `-O3 -ljemalloc`	49	405.2	119.4	46.2	85.8	0.61	1.51

整体指令数是 1120B，其中有 238B 条分支指令，MPKI 等于 3.37，在 SPEC INT 2026 中属于比较高的了。作为对比，SPEC INT 2017 Rate 中 502.gcc_r 的 MPKI 是 3.13，两者差异不大。

意料之中的是，用 GCC 14 编译的 721.gcc_r，运行得比用 LLVM 22 编译的 721.gcc_r 更快。

723.llvm_r¶

随着 LLVM 的发展，SPEC CPU 2026 终于是把 LLVM 也加入了进来。和 721.gcc_r 类似，也是跑 LLVM 的优化器，只不过输入直接就是 .bc 中间代码文件，而不是 C 代码。它包括两个负载：

# 1. transformsplus llvm-opt_r transformsplus.bc -S -O3 -mcpu=pwr9 # 2. codegen llvm-opt_r codegen.bc -S -O3 -mcpu=pwr9

-O3 运行时间分别为 62s 和 53s，总时间 115s，reftime 是 507s，对应 4.4 分。开 -O3 -flto 性能反而变差，不过开 -O3 -ljemalloc 有明显性能提升，运行时间降低为 59s 和 47s，总时间 106s，分数提高到 4.8 分。开 -march=native 对性能几乎没有影响。

有意思的是，用 GCC 14 编译的 723.llvm_r 比用 LLVM 22 编译的运行更快，不过优势并不大。下面针对这两个负载进行具体的分析。

1. transformsplus¶

使用 perf 观察热点函数：

llvm::InstCombinerImpl::foldIntegerTypedPHI(llvm::PHINode& PN) 来自 src/lib/Transforms/InstCombine/InstCombinePHI.cpp: 4.06%，对 IR 中的 PHI 结点进行处理，这个函数还挺复杂的，主要瓶颈在内层循环，遍历 use 链表，有比较多的随机访存和通过分支来判断 LLVM 自制 RTTI 的类型；
_int_malloc/cfree/malloc：2.38%+0.89%+0.82%=4.09%，大量的内存分配和释放，因此 -ljemalloc 能带来不错的性能提升；
llvm::DenseMapBase::FindAndConstruct(): 1.69%，LLVM 自己用数组实现的哈希表，主要瓶颈在读取哈希桶内的 entry 并比较 key，随机访存比较慢，近期 LLVM 也在做相关的优化。

其他有很多小的函数，占时间比例不高，和 721.gcc_r 类似，也是时间分散得比较开。执行指令数为 572.8B，其中 Load 指令有 137.7B，Store 指令有 78.6B，分支指令有 118.7B，错误预测有 3.5B 次，MPKI 等于 3.5B/572.8B*1000=6.11，挺高的。

从 perf record -e branch-misses:pp 来看，错误预测挺分散在很多个函数，每个函数比例也不高。从 Top down 来看，有 40% 都在 Frontend Bound，有 19.2% 在 Bad Speculation。更进一步分析，发现它的 L1 ICache 缺失次数为 12.6B（L1-icache-load-misses 性能计数器），对应的 L1IC MPKI 足足有 12.6B/572.8B*1000=22.0，可见主要问题是 723.llvm_r 的代码量太大了，L1IC 存不下，BTB 也够呛。

2. codegen¶

使用 perf 观察热点函数：

llvm::InstCombinerImpl::foldIntegerTypedPHI(llvm::PHINode& PN) 来自 src/lib/Transforms/InstCombine/InstCombinePHI.cpp: 20.85%，描述见上；
_int_malloc/cfree/malloc：1.91%+0.72%+0.65%=3.28%，描述见上；
llvm::DenseMapBase::FindAndConstruct(): 1.29%，描述见上。

整体的情况和 transformsplus 类似，只不过 foldIntegerTypedPHI 时间占比更高，其他还是有很多函数耗费很短的时间，分散得比较开。执行指令数为 415.9B，其中 Load 指令有 100.4B，Store 指令有 57.5B，分支指令有 86.0B，错误预测有 2.4B 次，MPKI 等于 2.4B/415.9B*1000=5.77，依然很高。

小结¶

各负载的情况如下：

负载	编译器 + 选项	时间 (s)	指令 (B)	Load (B)	Store (B)	分支 (B)	错误预测 (B)	MPKI
1. transformsplus	GCC 14 `-O3`	62	572.8	137.7	78.6	118.7	3.5	6.11
1. transformsplus	GCC 14 `-O3 -ljemalloc`	59	563.2	135.7	77.2	115.2	3.3	5.86
2. codegen	GCC 14 `-O3`	53	415.9	100.4	57.5	86.0	2.4	5.77
2. codegen	GCC 14 `-O3 -ljemalloc`	47	411.0	99.3	56.6	84.1	2.3	5.60

LLVM 和 GCC 同为编译器领域的双子星，在负载特性上也有相似之处：有很多的内存分配和释放，受益于 -ljemalloc；时间分布在大量小函数当中，热点不明显；MPKI 较高，尤其是 723.llvm_r 直接一跃成为 SPEC INT 2026 Rate 中 MPKI 最高的一个基准测试，可能是因为它有大量数据依赖的分支。723.llvm_r 整体的指令数有 991B，其中有 205B 是分支指令，MPKI 达到 5.98，即使放在 SPEC INT 2017 Rate 里，也能紧跟在 505.mcf_r 和 541.leela_r 两位大哥身后，成为 MPKI 第三高的项目。

727.cppcheck_r¶

cppcheck 是一个 cpp 静态分析工具，输入 C++ 文件，提供代码的分析报告，汇报数组越界访问或变量未初始化等等问题。它会分析三个不同的代码，根据命名看，应该是从其他基准测试里找的。747.dealii（成为了 766.femflow_r 的一部分）和 770.7z 不在 SPEC CPU 2026 当中，应该没被选上，只有 738 diamond 以 838.diamond_s 保留了下来：

# 1. 738_diamond cppcheck_r --force 738-diamond-record.cpp --checkers-report=738_report.txt --enable=all --output-file=738_bogey.txt # 2. 747_dealii cppcheck_r --force 747-dealii-data_out_base.cc --checkers-report=747_report.txt --enable=all --output-file=747_bogey.txt # 3. 770_7z cppcheck_r --force 770-7z-SystemPage.cpp --checkers-report=770_report.txt --output-file=770_bogey.txt

三条指令的运行时间分别为 27s、22s 和 33s，共 82s，reftime 是 359s，对应 4.4 分。开 -O3 -flto 或 -O3 -march=native 仅能略微提升 1% 的性能，但 -O3 -ljemalloc 能显著提升性能，运行时间缩短到 24s、18s 和 29s，总时间 71s，对应 5.1 分。

下面对这三个负载进行深入的分析。

1. 738_diamond¶

热点函数如下：

multiCompareImpl(const Token *tok, const char *haystack, nonneg int varid) 来自 src/lib/token.cpp：40.82%，字符串匹配函数，比如用 abc|def 去匹配一个 token，逐字符比较 token 和 haystack，匹配不上时跳到下一个 | 尝试 haystack 的下一个候选模式；
Token::Match(const Token *tok, const char pattern[], nonneg int varid) 来自 src/lib/token.cpp：12.08%，也是类似的字符串匹配函数，语法有些不同，类似自研正则表达式子集，它会调用上面的 multiCompareImpl 函数来做部分匹配；
ScopeInfo3::findScope(const std::string & scope) 来自 src/lib/tokenize.cpp：5.49%，循环，从当前作用域开始寻找对应的符号，如果没有，则检查更高一级的作用域，一般用于从变量名找到作用域里定义的符号，主要时间花在对 std::list 的遍历以及 std::string 的比较；
Tokenizer::simplifyUsing()：3.57%，把 using N::x; 变为 using x = N::x，里面就会用到上面说的 Token::Match，参数如 "using ::| %name% ::"，来做一些模式的匹配并进行相应的简化；
cfree/malloc/_int_malloc：0.47%+0.33%+0.45%=1.25%，内存分配相关。

可以看到，主要瓶颈在字符串匹配上，它的实现就是一个循环，用指针去扫描字符串，没有做数据结构上的优化。执行了 399.9B 条指令，其中有 81.2B 条 Load 指令，35.5B 条 Store 指令，108.9B 条分支指令，错误预测 173.2M 次，MPKI 等于 173M/399.9B*1000=0.43，不算高。

2. 747_dealii¶

热点函数类似：

multiCompareImpl(const Token *tok, const char *haystack, nonneg int varid) 来自 src/lib/token.cpp：27.42%，描述见上；
Token::Match(const Token *tok, const char pattern[], nonneg int varid) 来自 src/lib/token.cpp：14.55%，描述见上；
cfree/malloc/_int_malloc：2.14%+1.57%+0.53%=4.24%，内存分配的比例更高；
Token::simpleMatch(const Token *tok, const char pattern[], size_t pattern_len) 来自 src/lib/token.cpp：3.88%，又一个字符串匹配函数，换了种格式，比如 "abc def" 代表匹配 abc 或 def，这次的瓶颈是 strncmp 和 memchr；
TemplateSimplifier::addInstantiation(Token *token, const std::string &scope) 来自 src/lib/templatesimplifier.cpp：2.98%，在 token 级别上做一些代码简化的变换，主要的耗时在对 std::list 的遍历；
isAliasOf(const Token* tok, const Token* expr, int* indirect, bool* inconclusive) 来自 src/lib/astutils.cpp：2.55%，判断两个变量是否 alias。

依然有大量的字符串匹配，不太理解为何要设计多种语法，并分别实现多个字符串匹配函数。执行了 303.9B 条指令，其中有 67.3B 条 Load 指令，31.5B 条 Store 指令，82.5B 条分支指令，错误预测 298.9M 次，MPKI 等于 298.9M/303.9B*1000=0.98，也不算高。

3. 770_7z¶

热点如下：

multiCompareImpl(const Token *tok, const char *haystack, nonneg int varid) 来自 src/lib/token.cpp：32.25%，描述见上；
Token::Match(const Token *tok, const char pattern[], nonneg int varid) 来自 src/lib/token.cpp：18.82%，描述见上；
__memcmp_avx2_movbe：8.99%，被用于字符串匹配；
std::map<std::string>::equal_range：7.34%，红黑树上的查询，外加字符串匹配；
__strchr_avx2：7.34%，被用于字符串匹配；
cfree/malloc/_int_malloc：0.37%+0.27%+0.17%=0.81%，这次内存分配的比例较低。

依然是字符串匹配为主。执行了 505.2B 条指令，其中有 111.0B 条 Load 指令，43.8B 条 Store 指令，137.5B 条分支指令，错误预测 421.0M 次，MPKI 等于 421M/505.2B*1000=0.83，也不算高。

小结¶

整体看下来，727.cppcheck_r 就是在不断地做字符串匹配。一个值得思考的问题是，为何不直接通过 tokenizer 将 token 转为数字，这样比较起来快得多。在 token 级别上做各种变换，就在不停地对 token 进行字符串比较，导致最后的性能瓶颈，不是在 cppcheck 自己写的字符串比较，就是在 libc 的字符串比较里了。

各负载的情况如下：

负载	编译器 + 选项	时间 (s)	指令 (B)	Load (B)	Store (B)	分支 (B)	错误预测 (M)	MPKI
1. 738_diamond	GCC 14 `-O3`	27	399.9	81.2	35.5	108.9	173.2	0.43
1. 738_diamond	GCC 14 `-O3 -ljemalloc`	24	395.0	80.2	34.7	107.5	171.8	0.43
2. 747_dealii	GCC 14 `-O3`	22	303.9	67.3	31.5	82.5	298.9	0.98
2. 747_dealii	GCC 14 `-O3 -ljemalloc`	18	291.0	64.5	29.2	79.0	287.3	0.99
3. 770_7z	GCC 14 `-O3`	33	505.2	111.0	43.8	137.5	421.0	0.83
3. 770_7z	GCC 14 `-O3 -ljemalloc`	29	501.5	110.1	43.2	136.6	409.8	0.82

整体执行了 1211B 指令，其中有 329B 分支指令，分支指令的比例足足有 27%，傲视 SPEC INT 2026 Rate 全场，这都是拜字符串匹配所赐，读一点就比较一点。但同时，MPKI 仅为 0.71，在 SPEC INT 2026 Rate 中倒数第三，仅高于 714.cpython_r 的 0.17 和 750.sealcrypto_r 的 0.14，说明大部分字符串匹配的结果都是很好预测的，比如比较到第一个字节就对不上了。

729.abc_r¶

之前第一次看到 abc 还是在 yosys，它是一个 EDA 软件，和后面的 734.vpr_r 都是开源 EDA 工具里的重量级人物，分别实现了逻辑综合以及布局布线。该基准测试包括 6 个负载：

# 1. twoexact ./abc_r -F twoexact.in # 2. beem6 ./abc_r -F beem6-fraig.in # 3. mem ./abc_r -F mem_ctrl.in # 4. vga ./abc_r -F vga_lcd_miter.in # 5. mcml ./abc_r -F mcml.in # 6. des ./abc_r -F des_system90.in

六个负载运行时间都不长，分别是 6.3s、10.1s、13.5s、32.3s、13.6s 和 17.0s，总时间 92.8s，reftime 是 459s，对应 4.9 分。

开 -flto、-march=native 或 -ljemalloc 都没有什么提升，性能差距在 1% 以内，属于是油盐不进，各种优化都难以生效。下面进行具体热点分析。

1. twoexact¶

主要的热点函数：

sat_solver_propagate(sat_solver* s) 来自 src/berkeley-abc/src/sat/bsat/satSolver.c：75.33%，应该是 SAT Solver 中的 Unit Propagation，寻找那些只剩下一个变量还没确定的语句，给它进行赋值，然后传播到其他语句；
sat_solver_analyze(sat_solver* s, int h, veci* learnt) 来自 src/berkeley-abc/src/sat/bsat/satSolver：15.85%，应该是针对出现冲突的语句进行分析，属于 CDCL（Conflict Driven Clause Learning）的一部分；
sat_solver_solve_internal(sat_solver* s) 来自 src/berkeley-abc/src/sat/bsat/satSolver.c：3.80%，是 SAT Solver 的入口函数。

很少能见到这种瓶颈如此高度集中的情况了，不过确实，SAT Solver 大部分时间都在做 Unit Propagation，出现冲突了就做 CDCL。唤起了很久以前在《软件分析与验证》课上写 DPLL SAT Solver 的回忆，当然了，abc 的实现肯定比我那课程作业要更加复杂和高级。主要的瓶颈就是一堆访存以及依赖内存结果的分支，在 SAT 问题的解空间内进行搜索。

指令数 53.2B，其中 Load 指令 13.8B，Store 指令 3.2B，分支指令 8.4B，错误预测 606.2M，MPKI 等于 606.2M/53.2B*1000=11.39，非常的高，接近 SPEC INT 2017 的 541.leela_r 大帝。

通过 perf record -e branch-misses:pp，可以看到主要的分支预测错误来自 sat_solver_propagate 的几处变量取值的判断逻辑，都是依赖数据的分支，难以预测。

2. beem6¶

主要的热点函数：

Cec4_ManPackAddPatterns(Gia_Man_t * p, int iBit, Vec_Int_t * vLits) 来自 src/berkeley-abc/src/proof/cec/cecSatG2.c：54.65%，CEC 指的是 Combinational Equivalence Checking，该函数内层循环遍历 vLits 中的每个 Entry，通过位运算按一定条件更新 p->vSims；
Cec4_ManGeneratePatterns_rec(Gia_Man_t * p, Gia_Obj_t * pObj, int Value, Vec_Int_t * vPat, Vec_Int_t * vVisit) 来自 src/berkeley-abc/src/proof/cec/cecSatG2.c：29.01%，根据 pObj 的类型进行分类讨论和递归。

热点依然很集中，不过因为缺少领域知识，不太明白它在跑什么。运行 255.5B 条指令，其中 Load 有 57.2B，Store 有 7.3B，分支有 40.3B，错误预测 192.0M 次，MPKI 等于 192.0M/255.5B*1000=0.75，相比 SAT 来说低了很多。

3. mem¶

热点函数依然是 sat solver 相关，相比 twoexact，sat_solver_canceluntil 时间占比高了一些，达到了 8.46%，不过整体的特性基本是一样的。运行 151.0B 条指令，其中 Load 指令有 43.4B，Store 指令有 15.4B，分支有 24.2B，错误预测 1213.7M，MPKI 等于 1213.7M/151.0B*1000=8.03，非常高。

4. vga¶

热点函数依然是 sat solver，整体特性一致。运行 490.0B 条指令，Load 指令有 143.9B，Store 指令有 54.4B，分支有 76.9B，错误预测 2092.8M 次，MPKI 等于 2092.8M/490B*1000=4.27，还是很高。

5. mcml¶

热点函数终于有了新面孔：

Abc_ObjDeleteFanin(Abc_Obj_t * pObj, Abc_Obj_t * pFanin) 来自 src/berkeley-abc/src/base/abc/abcFanio.c：12.57%，逻辑很简单，就是调用 Vec_IntRemove 从数组里删除一个元素，遍历数组，找到匹配的元素，把后面的元素都往前挪，这个遍历匹配逻辑是主要的瓶颈，其次就是移动数据；
Gia_ManSwiSimulate(Gia_Man_t * pAig, Gia_ParSwi_t * pPars) 来自 src/berkeley-abc/src/aig/gia/giaSwitch.c：8.87%，实现模拟过程，很大一部分时间花在一个自己实现的 popcount 函数 Gia_WordCountOnes，它没有被识别并转化为 popcnt 指令，而是用 SSE 向量指令做软件 popcount；
Abc_AigAndLookup(Abc_Aig_t * pMan, Abc_Obj_t * p0, Abc_Obj_t * p1) 来自 src/berkeley-abc/src/base/abc/abcAig.c：7.03%，计算 p0 AND p1，先做特判（如 p0 == p1 时直接返回 p0），若都不命中则走哈希表链表遍历，中间有大量的多级指针访问：pObj->pNtk->vObjs->pArray；
If_ObjPerformMappingAnd(If_Man_t * p, If_Obj_t * pObj, int Mode, int fPreprocess, int fFirst) 来自 src/map/if/ifMap.c：6.72%，依然有不少时间花在 popcount 的软件实现 If_WordCountOnes 上；
Lpk_NodeCutsOneFilter(Lpk_Cut_t * pCuts, int nCuts, Lpk_Cut_t * pCutNew) 来自 src/berkeley-abc/src/opt/lpk/lpkCut.c：5.47%，瓶颈在数据依赖的比较分支上。

运行 208.0B 条指令，其中 50.1B 条 Load 指令，15.4B 条 Store 指令，39.8B 条分支指令，错误预测 534.8M 次，MPKI 等于 534.8M/208.0B*1000=2.57，不低。

6. des¶

再次出现了新的热点函数：

__strcmp_avx2 来自 libc：22.04%，没想到瓶颈居然又出现在了 strcmp 上；
Nm_ManTableLookupId(Nm_Man_t * p, int ObjId) 来自 src/misc/nm/nmTable.c：21.56%，遍历一个哈希表，哈希表的每个桶是个链表，遍历链表中的元素，寻找匹配，主要瓶颈也是这个访问链表和匹配；
Nm_ManTableAdd(Nm_Man_t * p, Nm_Entry_t * pEntry) 来自 src/misc/nm/nmTable.c：12.19%，经典的哈希表插入算法，把新元素插入到对应桶的链表当中，主要瓶颈在判断哈希表中是否已经有相同 key 的元素；
Nm_ManTableLookupName(Nm_Man_t * p, char * pName, int Type) 来自 src/misc/nm/nmTable.c：5.78%，同样是遍历哈希表查询，只不过这次用的是字符串匹配，解释了为啥 strcmp 调用次数那么多，其实是在找哈希表的字符串匹配；
Gia_ManSwiSimulate 来自 src/aig/gia/giaSwitch.c：5.49%，描述见上；
spec_qsort：3.98%，好久不见的熟悉面孔，在 SPEC INT 2017 年代，在 505.mcf_r 中有出色表现（指瓶颈在 qsort 上，且很大一部分开销来自于调用 comparator 函数指针，开 -flto 后因为把函数指针调用内联，性能直接提升 13%）。

这次又回归到经典的哈希表数据结构，且混入了大量字符串匹配，最终瓶颈落在哈希表查询上，然后对链表的访问的空间局部性也很差。

运行 135.7B 条指令，其中有 29.7B 是 Load 指令，11.5B 是 Store 指令，23.3B 是分支指令，错误预测 372.9M 次，MPKI 等于 372.9M/135.7B*1000=2.75，依然不低，从 perf record -e branch-misses:pp 来看，错误预测主要出自 __strcmp_avx2 和 spec_qsort。

小结¶

各负载的情况如下：

负载	编译器 + 选项	时间 (s)	指令 (B)	Load (B)	Store (B)	分支 (B)	错误预测 (M)	MPKI
1. twoexact	GCC 14 `-O3`	6.3	53.2	13.8	3.2	8.4	606.2	11.39
2. beem6	GCC 14 `-O3`	10.1	255.5	57.2	7.3	40.3	192.0	0.75
3. mem	GCC 14 `-O3`	13.5	151.0	43.4	15.4	24.2	1213.7	8.03
4. vga	GCC 14 `-O3`	32.3	490.0	143.9	54.4	76.9	2092.8	4.27
5. mcml	GCC 14 `-O3`	13.6	208.0	50.1	15.4	39.8	534.8	2.57
6. des	GCC 14 `-O3`	17.0	135.7	29.7	11.5	23.3	372.9	2.75

综合以上六个负载，可以看到它触碰了 abc 不同地方的代码，所以热点不尽相同，有 SAT，有看不懂的一些 EDA 相关逻辑，还有带字符串匹配的哈希表查询，其中 SAT 的占比是最大的。由于 SAT 的存在，最终的 MPKI 足足有 3.87，在 SPEC INT 2026 Rate 当中仅次于 723.llvm_r，超过了 721.gcc_r 和 777.zstd_r。

734.vpr_r¶

接下来就到了 EDA 的下一步，逻辑综合后，进行布局（place）布线（route），这就是 vpr_r 干的活。该基准测试分为四个负载：

# 1. jpeg_place vpr stratixiv_arch.timing.xml JPEG_stratixiv_arch_timing.blif --RL_agent_placement off --place_algorithm bounding_box --max_criticality 0.0 --init_t 512 --alpha_t 0.75 --exit_t 1 --router_initial_timing all_critical --routing_failure_predictor off --route_chan_width 300 --max_router_iterations 20 --router_lookahead classic --initial_pres_fac 1.0 --pres_fac_mult 2.0 --astar_fac 1.5 --router_profiler_astar_fac 1.5 --seed 3 --sdc_file JPEG_stratixiv_arch_timing.sdc --pack_verbosity 0 --netlist_verbosity 0 --base_cost_type demand_only --inner_num 4 --read_initial_place_file ref_JPEG_stratixiv_arch_timing.init.place --place # 2. jpeg_route vpr stratixiv_arch.timing.xml JPEG_stratixiv_arch_timing.blif --place_algorithm bounding_box --place_static_notiming_move_prob 50 25 25 --max_criticality 0.0 --router_initial_timing all_critical --routing_failure_predictor off --route_chan_width 300 --max_router_iterations 20 --router_lookahead classic --initial_pres_fac 1.0 --pres_fac_mult 2.0 --astar_fac 1.5 --router_profiler_astar_fac 1.5 --seed 3 --sdc_file JPEG_stratixiv_arch_timing.sdc --pack_verbosity 0 --netlist_verbosity 0 --base_cost_type demand_only --place_file ref_JPEG_stratixiv_arch_timing.place --analysis --route # 3. smithwaterman_place vpr stratixiv_arch.timing.xml smithwaterman_stratixiv_arch_timing.blif --RL_agent_placement off --place_algorithm bounding_box --max_criticality 0.0 --init_t 512 --alpha_t 0.75 --exit_t 1 --router_initial_timing all_critical --routing_failure_predictor off --route_chan_width 300 --max_router_iterations 20 --router_lookahead classic --initial_pres_fac 1.0 --pres_fac_mult 2.0 --astar_fac 1.5 --router_profiler_astar_fac 1.5 --seed 3 --sdc_file smithwaterman_stratixiv_arch_timing.sdc --pack_verbosity 0 --netlist_verbosity 0 --base_cost_type demand_only --inner_num 1.8 --read_initial_place_file ref_smithwaterman_stratixiv_arch_timing.init.place --place # 4. smithwaterman_route vpr stratixiv_arch.timing.xml smithwaterman_stratixiv_arch_timing.blif --place_algorithm bounding_box --place_static_notiming_move_prob 50 25 25 --max_criticality 0.0 --router_initial_timing all_critical --routing_failure_predictor off --route_chan_width 300 --max_router_iterations 20 --router_lookahead classic --initial_pres_fac 1.0 --pres_fac_mult 2.0 --astar_fac 1.5 --router_profiler_astar_fac 1.5 --seed 3 --sdc_file smithwaterman_stratixiv_arch_timing.sdc --pack_verbosity 0 --netlist_verbosity 0 --base_cost_type demand_only --place_file ref_smithwaterman_stratixiv_arch_timing.place --analysis --route

这里涉及的 Stratix IV 是经典的 Altera FPGA，如今已经是时代的眼泪了。四个负载的运行时间分别是 21s、29s、18s 和 19s，总时间 87s，reftime 是 461s，对应 5.3 分。开 -O3 -flto 后，时间降低到 19s、25s、17s 和 17s，总时间 78s，对应 5.9 分，提升显著。如果进一步开到 -O3 -flto -ljemalloc，时间进一步降低到 17s、24s、15s 和 16s，总时间 72s，对应 6.4 分，相比 -O3 提升了 20%。开 -march=native 只能带来不到 1% 的提升。

下面进行具体分析。

1. jpeg_place 和 3. smithwaterman_place¶

因为这两个负载都是做的布局（place），所以就放在一起分析了。它们的热点函数是类似的：

get_non_updateable_bb(ClusterNetId net_id, t_bb* bb_coord_new) 来自 src/vtr-vpr/vpr/src/place/place.cpp：jpeg_place 占比 13.98%，smithwaterman_place 占比 18.26%，遍历 pin，根据它的 x 和 y 坐标，找到 bounding box，即 xmin/xmax/ymin/ymax，主要时间花在读取 x 和 y 上；
try_swap(...) 来自 src/vtr-vpr/vpr/src/place/place.cpp：jpeg_place 占比 12.39%，smithwaterman_place 占比 11.46%，选一个 block 挪到空位置或与另一 block 交换，评估移动后的 cost，如果新的 cost 更优，就接受；
physical_tile_type(ClusterBlockId blk) 来自 src/vtr-vpr/vpr/src/util/vpr_utils.cpp：jpeg_place 占比 7.59%，smithwaterman_place 占比 7.75%，看起来是一些间接索引访存，先读取 block_loc 里的坐标，再从 grid 读取对应坐标的 type，这个函数会在 get_non_updateable_bb 和 get_bb_from_scratch 等地方被频繁调用；
get_bb_from_scratch(ClusterNetId net_id, t_bb* coords, t_bb* num_on_edges) 来自 src/vtr-vpr/vpr/src/place/place.cpp：jpeg_place 占比 6.73%，smithwaterman_place 占比 2.78%，和 get_non_updateable_bb 类似，也是求 bounding box；
malloc/_int_malloc/cfree 来自 libc：jpeg_place 占比 1.62%+1.26%+1.06%=3.94%，smithwaterman_place 占比 1.76%+1.42%+1.11%=4.29%。

开 -O3 -flto 后，能看到的是 physical_tile_type 被内联了进去，节省了频繁调用函数的开销。考虑到这个内存分配和释放的时间占比，-O3 -ljemalloc 提升性能并不意外。

-O3 下，jpeg_place 执行了 273.7B 条指令，其中 Load 有 84.5B 条，Store 有 26.9B 条，分支有 51.9B 条，错误预测 781.0M 次，MPKI 等于 781.0M/273.7B*1000=2.85，不低。smithwaterman_place 执行了 245.0B 条指令，其中 Load 有 76.4B 条，Store 有 24.7B 条，分支有 45.4B 条，错误预测 661.9M 次，MPKI 等于 661.9M/245.0B*1000=2.70。在 bounding box 计算 min/max 过程中，能看到一些 cmov 指令的使用，因此实际上已经少了一些容易预测错误的分支了。在一些没有 cmov 指令的 ISA 下，可能 MPKI 还会更高。

2. jpeg_route 和 4. smithwaterman_route¶

到了布线，热点函数出现了一些不同：

ConnectionRouter<BinaryHeap>::evaluate_timing_driven_node_costs(...) 来自 src/vtr-vpr/vpr/src/route/connection_router.cpp：jpeg_route 占比 9.35%，smithwaterman_route 占比 6.91%，计算 cost，有一些浮点计算；
ConnectionRouter<BinaryHeap>::timing_driven_add_to_heap(...) 来自 src/vtr-vpr/vpr/src/route/connection_router.cpp：jpeg_route 占比 9.34%，smithwaterman_route 占比 6.82%，会调用 evaluate_timing_driven_node_costs 计算 cost，然后插入到 Binary Heap 当中；
ConnectionRouter<BinaryHeap>::timing_driven_expand_neighbours(...) 来自 src/vtr-vpr/vpr/src/route/connection_router.cpp：jpeg_route 占比 8.14%，smithwaterman_route 占比 4.00%，搜索算法中的一步，遍历当前结点的邻居结点，若满足条件则调用 timing_driven_add_to_heap 入堆；
ClassicLookahead::get_expected_delay_and_cong(...) 来自 src/vtr-vpr/vpr/src/route/router_lookahead.cpp：jpeg_route 占比 7.86%，smithwaterman_route 占比 5.14%，计算延迟和拥塞，也有不少浮点计算；
BinaryHeap::get_heap_head() 来自 src/vtr-vpr/vpr/src/route/binary_heap.cpp：jpeg_route 占比 3.14%，smithwaterman_route 占比 1.64%，就是经典的最小二叉堆的实现，获取最小值，用的是浮点数做比较；
malloc/_int_malloc/cfree 来自 libc：jpeg_route 占比 1.10%+1.02%+0.78%=2.90%，smithwaterman_route 占比 1.62%+1.49%+1.08%=4.19%。

虽然不清楚具体算法，但看起来，就像是在做一些 cost 计算，然后通过 BinaryHeap 选择最小的 cost 去做一些扩展，有点类似搜索算法。

开 -O3 -flto 后，能看到的是 evaluate_timing_driven_node_costs 和 timing_driven_add_to_heap 被内联进 timing_driven_expand_neighbours，节省了频繁调用函数的开销，这个函数的时间占比提升到 jpeg_route 的 21.40% 和 smithwaterman_route 的 12.48%，类似的事情应该也发生在 get_expected_delay_and_cong 身上。考虑到这个内存分配和释放的时间占比，-O3 -ljemalloc 提升性能并不意外。

-O3 下，jpeg_route 执行了 424.1B 条指令，其中 Load 有 130.6B，Store 有 50.6B，分支有 79.0B 条，错误预测 1094.2M 次，MPKI 等于 1094.2M/424.1B*1000=2.58，不低。smithwaterman_route 执行了 305.8B 条指令，其中 Load 有 91.0B 条，Store 有 36.0B 条，分支有 59.4B 条，错误预测 609.3M 次，MPKI 等于 609.3M/305.8B*1000=1.99。

小结¶

各负载的情况如下：

负载	编译器 + 选项	时间 (s)	指令 (B)	Load (B)	Store (B)	分支 (B)	错误预测 (M)	MPKI
1. jpeg_place	GCC 14 `-O3`	21	273.7	84.5	26.9	51.9	781.0	2.85
1. jpeg_place	GCC 14 `-O3 -flto`	19	247.0	69.2	22.2	47.8	774.2	3.13
1. jpeg_place	GCC 14 `-O3 -ljemalloc`	19	261.5	81.9	25.1	47.9	764.5	2.92
2. jpeg_route	GCC 14 `-O3`	29	424.1	130.6	50.6	79.0	1094.2	2.58
2. jpeg_route	GCC 14 `-O3 -flto`	26	356.6	103.2	33.5	66.3	1075.5	3.02
2. jpeg_route	GCC 14 `-O3 -ljemalloc`	28	411.5	127.9	48.8	74.9	1080.0	2.62
3. smithwaterman_place	GCC 14 `-O3`	18	245.0	76.4	24.7	45.4	661.9	2.70
3. smithwaterman_place	GCC 14 `-O3 -flto`	17	222.1	63.1	20.8	21.8	662.7	2.98
3. smithwaterman_place	GCC 14 `-O3 -ljemalloc`	17	232.9	73.8	23.0	41.4	648.7	2.78
4. smithwaterman_route	GCC 14 `-O3`	19	305.8	91.0	36.0	59.4	609.3	1.99
4. smithwaterman_route	GCC 14 `-O3 -flto`	17	264.3	72.9	25.5	51.5	590.9	2.24
4. smithwaterman_route	GCC 14 `-O3 -ljemalloc`	18	293.6	88.4	34.2	55.3	594.7	2.03

734.vpr_r 的负载分为两部分，place 和 route，其中 place 主要在做 bounding box 的计算，route 主要在做搜索和优化。开 -flto 和 -ljemalloc 后有明显的性能提升，主要是靠内联了热点函数以及更快的内存分配。整体指令数为 1254B，分支指令数 237B，MPKI 是 2.51，处于中游偏高的水平。

735.gem5_r¶

gem5 是大家很熟悉的模拟器了，在 GEM5 里跑 SPEC CPU 2017 养活了很多博士生，这下终于完成闭环，在 GEM5 里跑 SPEC INT 2026 的 GEM5，自己跑自己。当然，735.gem5_r 的 workload 就不是 SPEC CPU 2026 了，没有继续套娃，而是跑的 RISC-V Linux 内核，以及生成访存序列对内存子系统进行测试。这也是唯一一个看到函数名就知道函数来自哪个文件的项目了，实在太熟悉了。包括如下四个负载：

# 1. o3 gem5sim --stats-file=run_riscv_boot.py_o3_10_--max-ticks_10_000_000_000_stats.stats.txt run_riscv_boot.py o3 10 --max-ticks 10_000_000_000 # 2. timing gem5sim --stats-file=run_riscv_boot.py_timing_4_--max-ticks_20_000_000_000.stats.txt run_riscv_boot.py timing 4 --max-ticks 20_000_000_000 # 3. traffic_21 gem5sim --stats-file=synthetic_traffic.py_LinearGenerator_21.stats.txt synthetic_traffic.py LinearGenerator 21 # 4. traffic_74_ruby gem5sim --stats-file=synthetic_traffic.py_LinearGenerator_74_--ruby.stats.txt synthetic_traffic.py LinearGenerator 74 --ruby

运行时间分别为 16s、21s、21s 和 31s，总时间 89s，reftime 是 487s，对应 5.4 分。各种编译选项的优化效果：

开 -O3 -flto 后运行时间降为 15s、20s、20s 和 29s，共 84s，对应 5.8 分，相比 -O3 提升 6%。对四个负载都有加速效果。
开 -O3 -flto -ljemalloc 后降为 14s、18s、16s 和 26s，共 74s，对应 6.6 分，相比 -O3 提升 20%。对四个负载都有比较显著的加速效果。
开 -O3 -march=native -flto -ljemalloc 后 12s、18s、16s 和 26s，共 72s，对应 6.8 分，相比 -O3 提升 24%。仅对第一个负载有加速效果。

看到这个性能提升的幅度，结合前面的经验，已经可以预估一下后面会见到的瓶颈大概是什么类型了。

1. o3¶

第一个负载是用 O3 CPU 模拟 RISC-V Linux 内核启动，热点函数如下：

malloc/_int_malloc/cfree/_int_free_chunk/operator new 来自 libc/libstdc++：4.78%+3.46%+3.29%+1.35%+1.16%=13.29%，这个比例无敌了，不过确实，Gem5 有大量的动态内存分配，比如各种内存请求，都要 new 一个 Packet 出来；
gem5::TimeBuffer<*>::advance() 来自 src/gem5/cpu/timebuf.hh：3.05%+2.43%+2.39%+2.28%+1.98%=12.13%，用于在各流水线级之间传递数据，维护一个滚动的时间窗口，主要的时间花在了 rep stos 或用 SSE 指令 movups 对内存进行初始化，还有调用构造/析构函数，涉及到一些引用计数的更新；
gem5::o3::IEW::tick() 来自 src/gem5/cpu/o3/iew.cc：3.32%，IEW 代表 Issue Execute Writeback，后端各执行单元的时序在这里模拟，瓶颈主要是 rep stos 指令，用于初始化数据。

其他就是很多零散的函数了，每个函数的耗时都不高。开启 -O3 -flto 后，热点函数变为：

std::_Function_handler<void (), gem5::o3::CPU::CPU(gem5::BaseO3CPUParams const&)::{lambda()#1}>::_M_invoke(std::_Any_data const&)：20.80% 实际上是 tickEvent([this]{ tick(); }, "O3CPU tick", false, Event::CPU_Tick_Pri) 当中调用 tick() 的 lambda，就是整个 O3 CPU 各种组件的单步模拟被融合到了一个巨大的函数里，仔细看里面的热点指令，其实还是 gem5::TimeBuffer<*>::advance() 相关的比较多；
gem5::o3::IEW::tick() 来自 src/gem5/cpu/o3/iew.cc：8.58%，描述见上；
malloc/_int_malloc/cfree/_int_free_chunk/operator new 来自 libc/libstdc++：5.55%+3.88%+3.72%+1.45%+1.22%=15.83%，随着其余部分被优化，内存分配的瓶颈更加明显了。

进一步开启 -O3 -flto -ljemalloc 后，内存分配时间减少，热点函数：

std::_Function_handler<void (), gem5::o3::CPU::CPU(gem5::BaseO3CPUParams const&)::{lambda()#1}>::_M_invoke(std::_Any_data const&)：23.20%，描述见上；
gem5::o3::IEW::tick() 来自 src/gem5/cpu/o3/iew.cc：9.19%，描述见上；
gem5::o3::Commit::commit() 来自 src/gem5/cpu/o3/commit.cc：4.56%，模拟 CPU 的 Commit 阶段；
malloc/_int_malloc/cfree/_int_free_chunk/operator new/operator delete 来自 libjemalloc：3.12%+1.02%+0.53%=4.67%，明显变少。

开启 -O3 -march=native 带来的效果是，用 memset 调用取代了之前的 rep stos，进而可以用更加高效的 AVX2 版本的 memset 来进行初始化，优化了 gem5::TimeBuffer<*>::advance() 的性能。

-O3 下，执行 211.1B 条指令，其中有 69.9B 条 Load 指令，31.7B 条 Store 指令，43.2B 条分支指令，错误预测 175.5M 次，MPKI 等于 175.5M/211.1B*1000=0.83，比较低。

2. timing¶

第二个负载则是把 O3 换成了 TimingSimpleCPU，相比 O3 模拟的复杂度低很多，此时主要的瓶颈挪到了 RISC-V 架构相关的代码、缓存模拟，以及内存分配上：

cfree/malloc/operator new 来自 libc：5.92%+4.56%+1.55%=12.03%，依然有很多内存分配的瓶颈；
gem5::RiscvISA::Decoder::decode(ExtMachInst mach_inst, Addr addr) 来自 src/gem5/arch/riscv/decoder.cc：8.97%，实现 RISC-V 指令集的 Decode，有很大一部分实现是自动生成的，在 src/gem5/arch/riscv/generated/decode-method.cc.inc 文件里，这里为了加速 Decode，用了一个 decode_cache::InstMap<ExtMachInst>（实际上就是 std::map<ExtMachInst, StaticInstPtr>）来加速，因此大部分的时间其实是在用红黑树实现的缓存中寻找已经 Decode 过的指令编码；
gem5::BaseTags::findBlock(Addr addr, bool is_secure) 来自 src/gem5/mem/cache/tags/base.cc：5.19%，用来实现组相连的 tag 比较，就是一个循环比较 tag 找匹配的算法，主要瓶颈就是 tag 比对；
gem5::PMAChecker::check(const RequestPtr &req) 来自 src/gem5/arch/riscv/pma_checker.cc：4.86%，实现 RISC-V 的 PMA 检查，属于 MMU 的一部分，逻辑很简单，就是循环判断一下请求地址是否属于某个 Uncacheable 地址区间，如果是，就标记 STRICT_ORDER，避免重排；
gem5::RiscvISA::ISA::readMiscReg(RegIndex idx) 来自 src/gem5/arch/riscv/isa.cc：3.34%，用于读取 RISC-V 的 CSR，GCC 这次是用若干 branch 来分别进入不同的 case 处理代码；
gem5::BaseCache::access(PacketPtr pkt, CacheBlk *&blk, Cycles &lat, PacketList &writebacks) 来自 src/gem5/mem/cache/base.cc：2.84%，用于模拟缓存的访问；
gem5::PMP::pmpCheck(const RequestPtr &req, BaseMMU::Mode mode, RiscvISA::PrivilegeMode pmode, ThreadContext *tc, Addr vaddr) 来自 src/gem5/arch/riscv/pmp.cc：2.66%，实现 RISC-V 的 PMP 检查，属于 MMU 的一部分，扫描 PMP 配置，逐个判断是否匹配。

开 -O3 -flto 后，readMiscReg 被内联。开 -O3 -flto -ljemalloc 后，内存分配的开销降低到 4.48%+1.34%=5.82%。-march=native 影响比较小。

-O3 下，执行 333.9B 条指令，其中有 113.9B 条 Load 指令，57.8B 条 Store 指令，69.8B 条分支指令，错误预测 202.9M 次，MPKI 等于 202.9M/333.9B*1000=0.61，比较低。

3. traffic_21¶

热点函数：

cfree/malloc/operator new 来自 libc：6.01%+4.62%+1.44%+1.40%=13.47%，依然有很多内存分配的瓶颈；
gem5::SnoopFilter::lookupRequest(const Packet* cpkt, const ResponsePort& cpu_side_port) 来自 src/gem5/mem/snoop_filter.c：5.93%，在总线上对 Snoop 请求进行 Filter，减少缓存一致性开销；它用一个 std::map 来维护状态，查询和更新耗费了不少时间，是主要的瓶颈；
gem5::AddrRange::removeIntlvBits(Addr a) 来自 src/gem5/base/addr_range.hh：3.39%，针对地址的 interleaving，进行一系列位运算，把 interleaving 的那部分比特去掉，保留其他的，具体实现方法是，找到要去掉的比特的位置，从小到大进行排序，然后把要保留的比特分段插入到结果当中，主要的瓶颈是 src/gem5/base/bitfield.hh 的 ctz64() 函数，GCC 14 会忠实地生成循环，GCC 15 会生成 rep bsfq 指令，如果进一步给 GCC 15 开 -mbmi，会生成 tzcnt 指令，应该会变快一些（Godbolt）；
gem5::BaseTags::findBlock(Addr addr, bool is_secure) 来自 src/gem5/mem/cache/tags/base.cc：3.18%，描述见上。

开启 -O3 -flto 后，热点函数中 removeIntlvBits 消失，时间转移到了 gem5::memory::DRAMInterface::decodePacket 和 gem5::memory::DRAMInterface::chooseNextFRFCFS。开 -O3 -flto -ljemalloc 后，内存分配的开销降低到 4.08%+1.39%=5.47%。-march=native 影响比较小。

-O3 下，执行 226.4B 条指令，其中有 65.5B 条 Load 指令，31.3B 条 Store 指令，50.8B 条分支指令，错误预测 749.3M 次，MPKI 等于 749.3M/226.4B*1000=3.31，明显变高。

4. traffic_74_ruby¶

相比 traffic_21，traffic_74_ruby 开启了 ruby（不是那个 ruby 编程语言），因此瓶颈来到了 gem5::ruby 相关：

cfree/malloc/operator new 来自 libc：4.43%+3.52%+1.29%+0.98%=10.22%，依然有很多内存分配的瓶颈；
gem5::ruby::Cache_Controller::processNextState(Cache_TBE*& m_tbe_ptr, Cache_CacheEntry*& m_cache_entry_ptr, Addr addr) 来自 src/gem5/mem/ruby/protocol/Cache_Controller.cc：4.44%，维护缓存的状态机，还挺复杂的；
gem5::ruby::NetDest::intersectionIsNotEmpty(const NetDest& other_netDest) 来自 src/gem5/mem/ruby/common/NetDest.cc：4.03%，做的是一些 std::bitset 的与操作，这也是主要的瓶颈；
gem5::ruby::MessageBuffer::isReady(Tick current_time) 来自 src/gem5/mem/ruby/network/MessageBuffer.cc：3.94%，维护了消息队列，判断当前时间是否有 ready 的消息；
gem5::ruby::Cache_Controller::getDirEntry(const Addr& param_addr) 来自 src/gem5/mem/ruby/protocol/Cache_Controller.cc：3.80%，根据地址找到 cache 对应的 entry，对 std::map 调用 operator []。

开启 -O3 -flto 后，gem5::ruby::NetDest::intersectionIsNotEmpty 被内联到 gem5::ruby::WeightBased::route 函数里，成为占时间最多的函数，占 6.45%。开启 -O3 -flto -ljemalloc 后，内存分配开销降低到 3.01%+0.83%=3.84%。-march=native 影响比较小。

-O3 下，执行 391.5B 条指令，其中有 103.2B 条 Load 指令，54.4B 条 Store 指令，82.1B 条分支指令，错误预测 1246.0M 次，MPKI 等于 1246.0M/391.5B*1000=3.18，依然较高。

小结¶

各负载的情况如下：

负载	编译器 + 选项	时间 (s)	指令 (B)	Load (B)	Store (B)	分支 (B)	错误预测 (M)	MPKI
1. o3	GCC 14 `-O3`	16	211.1	69.9	31.7	43.2	175.5	0.83
1. o3	GCC 14 `-O3 -ljemalloc`	15	189.5	65.0	28.0	37.0	204.8	1.08
1. o3	GCC 14 `-O3 -flto`	15	193.8	65.0	27.4	39.6	163.5	0.84
2. timing	GCC 14 `-O3`	21	333.9	113.9	57.8	69.8	202.9	0.61
2. timing	GCC 14 `-O3 -ljemalloc`	19	301.8	106.9	51.8	60.5	202.9	0.67
2. timing	GCC 14 `-O3 -flto`	21	324.4	111.6	56.2	67.0	194.7	0.60
3. traffic_21	GCC 14 `-O3`	21	226.4	65.5	31.3	50.8	749.3	3.31
3. traffic_21	GCC 14 `-O3 -ljemalloc`	18	198.0	59.2	26.1	42.7	723.3	3.65
3. traffic_21	GCC 14 `-O3 -flto`	20	216.1	62.8	29.2	48.1	745.4	3.45
4. traffic_74_ruby	GCC 14 `-O3`	31	391.5	103.2	54.4	82.1	1246.0	3.18
4. traffic_74_ruby	GCC 14 `-O3 -ljemalloc`	28	363.6	97.1	49.5	74.1	1200.3	3.30
4. traffic_74_ruby	GCC 14 `-O3 -flto`	29	361.3	96.7	48.6	75.5	1204.0	3.33

735.gem5_r 四个测试跑的是挺不一样的代码路径，第一个 o3 的主要瓶颈就是 O3CPU，第二个 timing 的主要瓶颈是 RISC-V 指令集相关的代码，第三个 traffic_21 主要是缓存和内存控制器，而 traffic_74_ruby 主要是用 ruby 模拟的内存子系统。由于 gem5 高度模块化，有些时候一些可以 inline 函数没有被 inline，所以 -flto 可以带来不错的性能提升。此外，gem5 很喜欢动态分配内存，运行过程中有很多动态产生的对象，比如 Packet 等等，所以用 -ljemalloc 能带来不错的提升。-march=native 确实不太有用武之地。

整体下来，执行 1164B 条指令，其中有 246B 条分支指令，MPKI 等于 2.05，不算高，主要由后两个 traffic 负载贡献。

750.sealcrypto_r¶

sealcrypto 做的是同态加密，只有一个负载做测试：

sealcrypto_r refrate ecuador_province_capitals_refrate.csv Galapagos

运行时间 108s，reftime 是 536s，对应 5.0 分。

很奇特的是，开 -O3 -flto 性能倒退，-O3 -flto -ljemalloc 性能没啥变化，开 -O3 -march=native -flto -ljemalloc 性能进一步倒退。但是，LLVM 22 异军突起，以接近两倍的性能超越了 GCC 和 LLVM 的其他版本，仅用 50.5s 跑完，对应 10.6 分。可以说，完全就靠 750.sealcrypto_r，才让 LLVM 22 在 SPEC INT 2026 整体性能上超越了 GCC 14。下面就来看看是怎么一回事。

首先还是对 -O3 的 GCC 14 进行热点分析：

seal::util::DWTHandler::transform_to_rev(ValueType *values, int log_n, const RootType *roots, const ScalarType *scalar = nullptr) 来自 src/seal/util/dwthandler.h：25.65%，这里 DWT 是离散小波变换 Discrete Wavelet Transform，上一次看到小波变换还是 Ghost Hunter，没想到在这里又遇到了，具体到指令上，就是一堆 imul/add/shr/shl 的运算指令；
seal::util::DWTHandler::transform_from_rev(ValueType *values, int log_n, const RootType *roots, const ScalarType *scalar = nullptr) 来自 src/seal/util/DWTHandler.h：16.58%，应该是 DWT 的逆过程，计算模式基本一样；
seal::util::multiply_uint64_generic(T operand1, S operand2, unsigned long long *result128) 来自 src/seal/util/uintarith.h：11.60%，实现了 64 位乘以 64 位得到 128 位结果的乘法，也是一堆乘法、加法和位运算；
seal::util::dot_product_mod(const uint64_t *operand1, const uint64_t *operand2, size_t count, const Modulus &modulus) 来自 src/seal/util/uintarithsmallmod.cpp：11.48%，实现的是点乘后取模的操作，调用 multiply_accumulate_uint64 函数进行乘法和累加，最后用 barrett_reduce_128 进行取模；
seal::util::dyadic_product_coeffmod(ConstCoeffIter operand1, ConstCoeffIter operand2, size_t coeff_count, const Modulus &modulus, CoeffIter result) 来自 src/seal/util/polyarithsmallmod.cpp：9.08%，实现的是 element wise 的模乘；
seal::util::BaseConverter::fast_convert_array(ConstRNSIter in, RNSIter out, MemoryPoolHandle pool) 来自 src/seal/util/rns.cpp：5.88%，这里的 RNS 应该是 Residue Number System 的缩写，指令上还是大量的 imul/add 等运算；
seal::util::RNSTool::sm_mrq(ConstRNSIter input, RNSIter destination, MemoryPoolHandle pool) 来自 src/seal/util/rns.cpp：5.40%，不确定在做什么，也是大量的运算。

总而言之，既然是密码学，就会有大量的整数运算，其中有不少的乘法和位运算，在素数域下做各种操作。执行指令数足足有 3113.4B，其中有 385.7B 条 Load 指令，161.3B 条 Store 指令，78.5B 条分支指令，错误预测 450.0M 次，MPKI 只有 450.0M/3113.4B*1000=0.14，全场最低，甚至低于 714.cpython_r，同时 IPC 全场最高，达到了 5.09。从 Top down 分析来看，80.7% 属于 Retiring，13.5% 属于 Backend Bound，说明处理器基本在全速跑指令。

开了 -O3 -march=native 后，确实生成了不少 AVX2 指令，但看下来，生成的指令序列还是挺复杂的，有大量的 vpunpcklqdq/vpunpckhqdq/vpermq/vpblendvb/vperm2i128 等指令，并没有在进行计算，而是在不断地倒腾向量寄存器里数据的位置，见 Godbolt。此时指令数降低到 2757.7B，其中有 370.0B 条 Load 指令，126.7B 条 Store 指令，268.6B 条 256 位整数向量指令（int_vec_retired.256bit 性能计数器），76.1B 条分支指令，错误预测 431.0M 次，MPKI 等于 431.0M/2757.7B*1000=0.16。虽然指令数减少了，但 IPC 降低更多，最后性能反而倒退，实际从 108s 增加到 116s。原来的 -O3 版本虽然每次只处理一个元素，但指令的并行度更高，IPC 弥补了指令数多的劣势。GCC 16 的 -march=native 就好多了，生成的指令少了很多数据重排的指令，基本都是 vpaddq/vpsubq/vpmuludq/vpsllq/vpsrlq 这类计算指令，向量化方法不一样，见 Godbolt。

那么，LLVM 22 做了什么优化呢？执行的指令数直接降低到 1213.6B，其中 Load 指令有 302.8B，Store 指令有 109.2B，分支只有 57.2B，错误预测 1093.9M，MPKI 等于 1093.9M/1213.6B*1000=0.90。以 seal::util::DWTHandler::transform_to_rev 为例，可以看到：seal 为了实现 64 位乘 64 位到 128 位的乘法，它自己实现了这个过程，不仅在 seal::util::multiply_uint64_generic 中有实现，实际上也内联到了 seal::util::DWTHandler::transform_to_rev 当中；GCC 14 忠实地实现了这个算法，因此指令数很多（见 Godbolt）；但其实，AMD64 的 mul 指令本来就是一个 64 位乘 64 位得到 128 位的乘法，所以 LLVM 22 直接识别出这段代码做的事情，然后编译成了 mul 指令（见 Godbolt，甚至如果开了 BMI2 扩展，还有 mulx 指令可以用），而且这种 64 位乘法保留高位的指令在各种 ISA 都挺常见的，比如 ARM64 的 umulh，RISC-V 的 mulhu，LoongArch 的 mulh.du。当然，seal 的源码其实已经考虑了这个问题，在编译器支持的情况下，直接用 __int128 来完成这件事情。类似的事情在 706.stockfish_r 的 1to6_classical 中也出现了。然而，这类依赖编译器行为或具体指令集扩展的代码，由于 SPEC CPU 2026 的编译器中立性，都被去掉了，都会回落到最通用的写法上。此时，就只能依赖编译器去自己识别和优化了。

但这样某种意义上也无法反映真实场景中应用的优化情况，因为很多应用已经实际上和处理器的指令集扩展/编译器扩展共进化，实现的时候，脑子里是默认有这些东西，再去做的调优，甚至会写一些指令集相关的优化，用一些 intrinsics，比如原版 stockfish 就有针对 AVX512/AVX2/SSSE3/NEON_DOTPROD/LASX/LSX 的优化。到最后，就是编译器又实现各种 pass，识别程序里的 fallback generic 代码，再映射回高效的实现。其实类似的事情之前就出现过，网上用来证明编译器很聪明的一个例子，就是说识别 popcount 的循环，直接翻译成 popcnt 指令，然而很多程序直接用 __builtin_popcount 而不会真的去手写，这次只不过是换了个 pattern 罢了。当然，好消息是，C++20 引入了 std::popcount，可以一定程度避免类似的情况发生，只是来得太晚了。

相比之下，Geekbench 对这类指令集扩展的优化就比较持开放态度，愿意针对指令集扩展进行针对性的优化，比如经典引入 AMX/SME 对分数的巨大影响，当然这也让它被人骂 AppleBench，只能说见仁见智了。

与此同时，LLVM 22 明显生成了更多的错误预测，用 perf record -e branch-misses:pp 找了一下问题，有 46.81% 的错误预测都出在 sm_mrq 函数当中，主要问题出在它内联的来自 src/seal/util/uintarithsmallmod.h 的 multiply_uint_mod 函数，它最后有一步，如果结果大于模 p，就要减去 p：SEAL_COND_SELECT(tmp2 >= p, tmp2 - p, tmp2)，学过 Montgomery Multiplication 的话应该很熟悉，因为它只能保证优化后的计算结果与真实结果在模 p 结果下相等，但是范围会更大，最大不会超过两倍的 p，所以需要最后做一个处理，这里是 Barrett Reduction，原理是类似的。这个 SEAL_COND_SELECT 宏是这么定义的，此处 SEAL_AVOID_BRANCHING 没有被定义，实际用的是上面的 ternary operator：

// Conditionally select the former if true and the latter if false // This is a temporary solution that generates constant-time code with all compilers on all platforms. #ifndef SEAL_AVOID_BRANCHING #define SEAL_COND_SELECT(cond, if_true, if_false) (cond ? if_true : if_false) #else #define SEAL_COND_SELECT(cond, if_true, if_false) \  ((if_false) ^ ((~static_cast<uint64_t>(cond) + 1) & ((if_true) ^ (if_false)))) #endif

LLVM 22 使用分支实现上面的逻辑，只有在 tmp2 >= p 的情况下才会进行 tmp2 - p 的计算，否则就是计算 tmp2 - 0，指令序列大概是这样：

# 初始化 rax = 0 mov $0x0,%eax # 比较 tmp2(rcx) 和 p(r10) cmp %r10,%rcx # 如果 p > tmp2，跳转到下面的 label: jb label # rax = r10，即 rax = p mov %r10,%rax label: # 计算 tmp2 - rax sub %rax,%rcx

如此计算确实少了，但是分支预测错误率又很高，除非硬件上做 Short Forward Branch 转 Predication 的逻辑（详见浅谈乱序执行 CPU（三：前端））。GCC 14 是这么实现的：

# tmp2 保存在 rax 寄存器，p 保存在 rdx 寄存器 # rcx = rax，即 rcx = tmp2 mov %rax,%rcx # rcx -= rdx，即 rcx = tmp2 - p sub %rdx,%rcx # 比较 tmp2 和 p cmp %rdx,%rax # 如果 tmp2 >= p，则 rax = rcx = tmp2 - p，否则 rax 保持原来的 tmp2 不变 cmovae %rcx,%rax

GCC 14 通过 cmov 指令避免了大量的错误预测，就是这点差别，造成了 LLVM 22 相比 GCC 14 巨大的 MPKI 差距。如果 LLVM 22 在这里选择用 cmov，那性能还能继续往上提一提。事实上，LLVM 22 确实也能在很多地方用 cmov 代替分支，但为什么在这个具体场景下，最后放弃了这个优化，还需要进一步的研究。

LLVM 22 开 -O3 -march=native 后分支预测有所改善，错误预测从 1093.9M 降到 612.7M（MPKI=0.54）。不过改进不在 sm_mrq 函数（它依然用分支而非 cmov），而是 DWTHandler::transform_from_rev 和 RNSTool::fastbconv_sk。这两个函数同样有 SEAL_COND_SELECT 宏，但此时 cond ? if_true : if_false 被编译成了 vpcmpgtq + vblendvpd，相当于把 cmov 向量化了。标量时 LLVM 22 不愿意用 cmov，为了向量化反而自己给实现了出来。

750.sealcrypto_r 在不同编译器和编译选项下的情况如下：

编译器 + 选项	时间 (s)	指令 (B)	Load (B)	Store (B)	分支 (B)	错误预测 (M)	MPKI
GCC 14 `-O3`	108	3113.4	385.7	161.3	78.5	450.0	0.14
GCC 14 `-O3 -march=native`	116	2757.7	370.0	126.7	76.1	431.0	0.16
GCC 15 `-O3`	106.4	3071.3	379.1	161.4	80.0	416.1	0.14
GCC 15 `-O3 -march=native`	117.7	2701.9	379.4	130.6	77.6	406.9	0.15
GCC 16 `-O3`	105.9	3020.1	381.1	158.5	80.7	430.3	0.14
GCC 16 `-O3 -march=native`	99.3	2492.3	328.0	123.2	81.8	433.3	0.17
LLVM 22 `-O3`	50.5	1213.6	302.8	109.2	57.2	1093.9	0.90
LLVM 22 `-O3 -march=native`	48.2	1126.0	299.2	108.7	53.4	612.7	0.54

753.ns3_r¶

753.ns3_r 和 710.omnetpp_r 做的事情类似，也是网络中的离散事件模拟器。它包括这些负载：

# 1. mobile ns3_r mobile-scenario --simTimeMinutes=3 --RngSeed=1 --RngRun=1 # 2. tcp ns3_r tcp-pacing --simulationEndTime=500 --useEcn=false --RngSeed=1 --RngRun=1 # 3. lena ns3_r lena-radio-link-failure --numberOfEnbs=2 --interSiteDistance=800 --simTime=200 --RngSeed=1 --RngRun=1 # 4. dctcp ns3_r dctcp-example --enableSwitchEcn=true --flowStartupWindow=0.4 --convergenceTime=0.4 --measurementWindow=0.4 --RngSeed=1 --RngRun=1 # 5. wifi_mixed ns3_r wifi-mixed-network --isUdp=0 --payloadSize=3072 --simulationTime=25 --RngSeed=1 --RngRun=1 # 6. wifi_eht ns3_r wifi-eht-network --simulationTime=0.2 --frequency=5 --useRts=1 --minExpectedThroughput=6 --maxExpectedThroughput=547 --RngSeed=1 --RngRun=1

六个负载的耗时分别为 18s、15s、3s、19s、23s 和 14s，一共 92s，reftime 是 613s，对应 6.7 分。各编译选项对性能影响：

-O3 -flto：时间降到 16s、14s、3s、17s、19s 和 13s，一共 82s，对应 7.5 分，相比 -O3 提升 12% 的性能；
-O3 -flto -ljemalloc：时间进一步降到 14s、12s、3s、13s、18s 和 11s，一共 71s，对应 8.6 分，相比 -O3 -flto 又提升 15% 性能。

都有巨大提升，只有 -march=native 影响很小，仅 0.5%。下面来进行具体的分析。

1. mobile¶

热点分析：

cfree/malloc/_int_malloc/_int_free_chunk/operator new 来自 libc/libstdc++：6.99%+5.66%+4.15%+1.83%+1.81%=20.44%，又是内存分配密集型应用；
ns3::LteMiErrorModel::GetTbDecodificationStats(const SpectrumValue& sinr, const std::vector<int>& map, uint16_t size, uint8_t mcs, HarqProcessInfoList_t miHistory) 来自 src/ns-3.38/src/lte/model/lte-mi-error-model.cc：9.57%，首先是一个循环，带有一些浮点运算，做一些累加和乘加操作，然后是一段二分查找，看起来主要瓶颈是在二分查找上面，此外在函数开头还会调用下面的 Mib 函数；
ns3::LteMiErrorModel::Mib(const SpectrumValue& sinr, const std::vector<int>& map, uint8_t mcs) 来自 src/ns-3.38/src/lte/model/lte-mi-error-model.cc：4.39%，又是一些浮点运算，不知道在算什么，还会调用 ns3::SpectrumValue::operator[]，做一些浮点比较；
ns3::LteMiErrorModel::MappingMiBler(double mib, uint8_t ecrId, uint16_t cbSize) 来自 src/ns-3.38/src/lte/model/lte-mi-error-model.cc：3.53%，主要的开销是浮点运算、调用 erf 函数和做一些查表，__erf 函数占了总时间的 1.63%；
ns3::MapScheduler::Insert(const Event& ev) 来自 src/ns-3.38/src/core/model/map-scheduler.cc：2.66%，主要瓶颈在对 std::map 红黑树的插入。

首先能看到的是，又是一个内存分配密集型应用。开了 -O3 -flto 后，GetTbDecodificationStats 把 Mib 内联了进去，时间占比提升到 12.68%，但还是内存分配占了最多的时间：7.82%+6.22%+4.51%+1.90%=20.45%。进一步开 -O3 -flto -ljemalloc，内存分配的时间占比终于降低到 6.23%+1.78%=8.01%，其实还是挺高的。

比较少见的是，作为 SPEC INT 2026 Rate 的一员，mobile 涉及不少浮点运算，还包括一些对 libm 的调用，比如 erf/atan2/pow/log，但实际瓶颈又是内存分配，算是半步踏入了 SPEC FP 2026 的领域，但又因为大量 libc 调用而被拉了回来。

-O3 下，执行指令 257.2B，其中 Load 指令有 66.6B，Store 指令有 35.4B，分支指令有 54.4B，错误预测 631.1M，MPKI 等于 631.1M/257.2B*1000=2.45，并不低。从 perf record -e branch-misses:pp 来看，主要的错误预测来自于内存分配器以及 std::map 红黑树的插入算法。

2. tcp¶

第二个负载测的又是不一样的代码了，这次的热点函数：

cfree/malloc/_int_malloc/_int_free_chunk/operator new 来自 libc/libstdc++：7.02%+5.20%+3.68%+2.29%+1.56%=19.75%，又是内存分配密集型应用；
ns3::TcpTxBuffer::NextSeg(SequenceNumber32* seq, SequenceNumber32* seqHigh, bool isRecovery) 来自 src/ns-3.38/src/internet/model/tcp-tx-buffer.cc：4.35%，是一个 TCP 协议栈实现，这里做的是 RFC 6675 SACK 的部分，想起来之前设计的 TCP 实验，这里主要的瓶颈是循环里对 sequence number 的更新；
ns3::MapScheduler::Insert(const Event& ev) 来自 src/ns-3.38/src/core/model/map-scheduler.cc：4.05%，描述见上；
__do_dyncast/__dynamic_cast 来自 libstdc++：1.80%+1.55%=3.35%。

-O3 下，执行指令 204.8B，其中 Load 指令有 63.5B，Store 指令有 41.4B，分支指令有 45.4B，错误预测 148.1M，MPKI 等于 148.1M/204.8B*1000=0.72，比较低。从 perf record -e branch-misses:pp 来看，主要的错误预测来自于内存分配器以及 std::map 红黑树的插入和删除算法。

3. lena¶

第三个负载测的又是不一样的代码了，这次的热点函数：

cfree/malloc/_int_malloc/_int_free_chunk/operator new 来自 libc/libstdc++：7.78%+6.13%+3.13%+2.08%+1.52%=20.64%，又是内存分配密集型应用；
ns3::MapScheduler::Insert(const Event& ev) 来自 src/ns-3.38/src/core/model/map-scheduler.cc：2.41%，描述见上；
__do_dyncast/__dynamic_cast 来自 libstdc++：1.73%+0.82%=2.55%。

-O3 下，执行指令 46.6B，其中 Load 指令有 14.2B，Store 指令有 9.6B，分支指令有 10.4B，错误预测 53.4M，MPKI 等于 53.4M/46.6B*1000=1.15，不高。从 perf record -e branch-misses:pp 来看，主要的错误预测来自于内存分配器以及 std::map 红黑树的插入和删除算法。

4. dctcp¶

第四个负载测的又是不一样的代码了，这次的热点函数：

cfree/malloc/_int_malloc/_int_free_chunk/operator new 来自 libc/libstdc++：6.30%+5.56%+4.03%+1.53%+1.43%+1.12%=40.61%，又是内存分配密集型应用；
ns3::MapScheduler::Insert(const Event& ev) 来自 src/ns-3.38/src/core/model/map-scheduler.cc：6.94%，描述见上。

-O3 下，执行指令 225.3B，其中 Load 指令有 71.1B，Store 指令 43.9B，分支指令有 52.3B，错误预测 295.8M，MPKI 等于 295.8M/225.3B*1000=1.31，略高一点。从 perf record -e branch-misses:pp 来看，主要的错误预测来自于内存分配器以及 std::map 红黑树的插入和删除算法。

5. wifi_mixed¶

热点函数就不列举了，基本还是内存分配，外加 ns3::TcpTxBuffer::NextSeg。-O3 下，执行指令 291.8B，其中 Load 指令有 88.8B，Store 指令有 52.7B，分支指令有 66.5B，错误预测 201.9M，MPKI 等于 201.9M/291.8B*1000=0.69，不高，错误预测的主要来源除了内存分配器和 std::map，还多了一个 __memcmp_avx2_movbe。

6. wifi_eht¶

热点函数除了内存分配，多了 ns3::InterferenceHelper::AppendEvent 和 ns3::WifiSpectrumValueHelper::GetBandPowerW。-O3 下，执行指令 194.3B，其中 Load 指令有 58.1B，Store 指令有 32.6B，分支指令有 44.0B，错误预测 372.0M，MPKI 等于 372.0M/194.3B*1000=1.91，略高，从 perf record -e branch-misses:pp 来看，错误预测主要来自于 ns3::InterferenceHelper::AppendEvent 内联的 std::map 的查询代码。

小结¶

各负载的情况如下：

负载	编译器 + 选项	时间 (s)	指令 (B)	Load (B)	Store (B)	分支 (B)	错误预测 (M)	MPKI
1. mobile	GCC 14 `-O3`	18	257.2	66.6	35.4	54.4	631.1	2.45
2. tcp	GCC 14 `-O3`	15	204.8	63.5	41.4	45.4	148.1	0.72
3. lena	GCC 14 `-O3`	3	46.6	14.2	9.6	10.4	53.4	1.15
4. dctcp	GCC 14 `-O3`	19	225.3	71.1	43.9	52.3	295.8	1.31
5. wifi_mixed	GCC 14 `-O3`	23	291.8	88.8	52.7	66.5	201.9	0.69
6. wifi_eht	GCC 14 `-O3`	14	194.3	58.1	32.6	44.0	372.0	1.91

与 727.cppcheck_r 类似，753.ns3_r 又是一个内存分配器 benchmark，大量时间花在 malloc/free 上了，此外还有不少 std::map 或 libm 的调用。-O3 下，执行指令 1221B，分支指令 273B，MPKI 是 1.39。

777.zstd_r¶

作为 SPEC INT 2026 中唯一一个压缩算法，把 SPEC INT 2017 的 557.xz_r 替换掉了，也能见到压缩算法的变迁。从没有被选中的 770.7z_r 来看，zstd 也是成功杀出重围，被认为是更加重要的压缩算法。它一共包括八个负载，但其实压缩的都是同一个文件，不像 557.xz_r 那样会压缩不同的输入文件，只是在代码里对输入数据做了随机修改：

# 1. b3 zstd -b3 -e3 --verbose -i40 cld.tar # 2. b5 zstd -b5 -e5 --verbose -i25 cld.tar # 3. b7 zstd -b7 -e7 --verbose -i12 cld.tar # 4. b10 zstd -b10 -e10 --verbose -i6 cld.tar # 5. b14 zstd -b14 -e14 --verbose -i4 cld.tar # 6. b16 zstd -b16 -e16 --verbose -i1 cld.tar # 7. b18 zstd -b18 -e18 --verbose -i1 cld.tar # 8. b19 zstd -b19 -e19 --verbose -i1 cld.tar

这里的 -b 代表 compression level 下界，-e 代表 compression level 上界，都相等，其实就是每次只测一种 compression level 的意思。8 个负载的运行时间：11.0s、14.5s、13.0s、11.6s、24.5s、10.9s、20.1s 和 25.5s，一共是 131.2s，reftime 是 644s，对应 4.9 分。

开 -O3 -flto 或 -O3 -ljemalloc 没有什么性能提升，但 -O3 -march=native 提升不错，运行时间降低到 10.5s、13.7s、12.6s、11.4s、23.4s、10.3s、18.6s 和 23.5s，一共是 124.0s，对应 5.2 分，提升 6%。

以第一个负载 b3 为例，热点函数：

ZSTD_compressBlock_doubleFast_noDict_generic 来自 src/zstd-1.5.6/lib/compress/zstd_double_fast.c：56.82%，主要在对数据计算哈希，寻找匹配，进而用于压缩，具体算法没有仔细看，挺复杂的；
ZSTD_decompressBlock_internal.part.0 来自 src/zstd-1.5.6/lib/decompress/zstd_decompress_block.c：16.63%，解压缩的主要逻辑，会调用 ZSTD_decompressSequences，挺复杂的；
ZSTD_encodeSequences 来自 src/zstd-1.5.6/lib/compress/zstd_compress_sequences.c：10.91%，分为 bmi2 和 generic 版本，不出意外 bmi2 版本也被 SPEC 禁用了，只能用 generic 版本，逻辑也挺复杂的，没有仔细看。

-O3 下，b3 执行 181.4B 条指令，其中有 49.9B 条 Load 指令，17.7B 条 Store 指令，19.1B 分支指令，错误预测 543.9M 次，MPKI 等于 543.9M/181.4B*1000=3.00，属于比较高的。从 perf record -e branch-misses:pp 来看，有 78.98% 的错误预测来自 ZSTD_compressBlock_doubleFast_noDict_generic，主要是在一些数据依赖的分支上，比如 if (MEM_read64(matchl0) == MEM_read64(ip))；其余有 14.91% 来自 ZSTD_decompressBlock_internal.part.0，主要是 if (ofBits > 1) 等分支。

第二个负载 b5 的热点函数：

ZSTD_RowFindBestMatch.constprop.0 来自 src/zstd-1.5.6/lib/compress/zstd_lazy.c：67.91%，对数组进行循环，找到匹配最长的一项；
ZSTD_compressBlock_lazy_generic.constprop.0 来自 src/zstd-1.5.6/lib/compress/zstd_lazy.c：9.12%，也是比较复杂的匹配算法；
ZSTD_decompressBlock_internal.part.0 来自 src/zstd-1.5.6/lib/decompress/zstd_decompress_block.c：7.80%，描述见上。

-O3 下，b5 执行 273.6B 条指令，其中有 61.3B 条 Load 指令，35.1B 条 Store 指令，28.4B 分支指令，错误预测 562.4M 次，MPKI 等于 562.4M/273.6B*1000=2.06，属于比较高的。错误的分支预测有 78.92% 来自 ZSTD_RowFindBestMatch.constprop.0。

第五个负载 b14 的热点函数：

ZSTD_DUBT_findBestMatch 来自 src/zstd-1.5.6/lib/compress/zstd_lazy.c：85.74%，也是在循环中做最长匹配；
ZSTD_searchMax.constprop.0 来自 src/zstd-1.5.6/lib/compress/zstd_lazy.c：9.04%，根据 dict mode 派发到不同的实现，实现也挺复杂。

-O3 下，b14 执行 197.6B 条指令，其中有 48.8B 条 Load 指令，16.5B 条 Store 指令，29.1B 分支指令，错误预测 1609.6M 次，MPKI 等于 1609.6M/197.6B*1000=8.15，属于特别高的。错误的分支预测有 94.94% 来自 ZSTD_DUBT_findBestMatch，比如 if (match[matchLength] < ip[matchLength]) 的分支。

第六个负载 b16 的热点函数：

ZSTD_insertBtAndGetAllMatches 来自 src/zstd-1.5.6/lib/compress/zstd_opt.c：38.62%，这里 Bt 代表的是 binary tree 二叉树；
ZSTD_insertBt1 来自 src/zstd-1.5.6/lib/compress/zstd_opt.c：35.15%；
ZSTD_compressBlock_opt_generic.constprop.1 来自 src/zstd-1.5.6/lib/compress/zstd_opt.c：16.50%。

-O3 下，b16 执行 129.1B 条指令，其中有 29.9B 条 Load 指令，11.2B 条 Store 指令，18.0B 条分支指令，错误预测 652.1M 次，MPKI 等于 652.1M/129.1B*1000=5.05，也是属于特别高的。错误的分支预测有 40.69% 来自 ZSTD_insertBtAndGetAllMatches，37.45% 来自 ZSTD_insertBt1，比如 if (match[matchLength] < ip[matchLength]) 的分支。

第三/四个负载 b7/b10 的热点与第二个负载 b5 类似；第七/八个负载 b18/b19 的热点函数和第六个负载 b16 类似，就不重复了。可见 zstd 会根据 compression level 选择不同路径，从而在压缩率和性能之间做出权衡。

那么开 -march=native 以后，发生了什么？能看到的是，由于 BMI 指令的引入，一些位运算的指令数变少了，比如 bzhi 和 tzcnt，还有一些是三操作数且不影响 flags 的运算，如 shrx，有点类似一些 RISC 指令集（如 RISC-V）的对应指令。开 -march=native 前后各负载的情况如下表：

负载	编译器 + 选项	时间 (s)	指令 (B)	Load (B)	Store (B)	分支 (B)	错误预测 (M)	MPKI
1. b3	GCC 14 `-O3`	11.0	181.4	49.9	17.7	19.1	543.9	3.00
1. b3	GCC 14 `-O3 -march=native`	10.5	170.4	49.9	18.3	18.9	543.8	3.19
2. b5	GCC 14 `-O3`	14.5	273.6	61.3	35.1	28.4	562.4	2.06
2. b5	GCC 14 `-O3 -march=native`	14.0	250.5	59.7	35.4	28.3	559.1	2.23
3. b7	GCC 14 `-O3`	13.0	228.5	48.9	25.8	29.8	599.3	2.62
3. b7	GCC 14 `-O3 -march=native`	12.7	207.4	46.6	26.0	29.8	596.7	2.88
4. b10	GCC 14 `-O3`	11.6	207.2	41.5	17.6	32.6	516.3	2.49
4. b10	GCC 14 `-O3 -march=native`	11.5	184.0	37.8	17.8	32.6	569.6	3.10
5. b14	GCC 14 `-O3`	24.5	197.6	48.8	16.5	29.1	1609.6	8.15
5. b14	GCC 14 `-O3 -march=native`	23.7	190.1	46.7	15.9	27.8	1612.5	8.48
6. b16	GCC 14 `-O3`	10.9	129.1	29.9	11.2	18.0	652.1	5.05
6. b16	GCC 14 `-O3 -march=native`	10.2	124.7	30.7	12.0	17.3	646.5	5.18
7. b18	GCC 14 `-O3`	20.1	265.8	57.0	17.0	32.6	987.7	3.72
7. b18	GCC 14 `-O3 -march=native`	18.4	259.2	57.0	17.2	31.4	980.7	3.78
8. b19	GCC 14 `-O3`	25.5	342.0	72.9	19.1	41.8	1060.6	3.10
8. b19	GCC 14 `-O3 -march=native`	23.4	332.8	72.7	19.1	40.1	1050.2	3.16

整体来看，-O3 下 777.zstd_r 执行 1827B 指令，其中 232B 是分支指令，但 MPKI 有 3.58，仅次于 729.abc_r 和 723.llvm_r。

讨论¶

编译器选项对比¶

综合下来，编译选项对 SPEC INT 2026 Rate 的性能影响还是不小的，比如：

-flto 对 707.ntest_r、710.omnetpp_r、714.cpython_r、734.vpr_r、735.gem5_r、753.ns3_r 都有一定的性能提升，当热点分散在多个函数，且很多函数都很小时，开 LTO 能带来一定程度的优化，本质上挽回了因可读性而拆分文件带来的性能开销
-ljemalloc 对 710.omnetpp_r、721.gcc_r、723.llvm_r、727.cppcheck_r、734.vpr_r、735.gem5_r、753.ns3_r 有性能提升，只能说这些软件做了太多的动态内存分配，有一些 benchmark 直接就是内存分配器 benchmark 了，此时替换 glibc 为 jemalloc/mimalloc 都有不错的性能提升，不过最新 glibc 也在改进 malloc 性能，不知道改进得怎样了？
-march=native 对 706.stockfish_r、707.ntest_r、735.gem5_r、777.zstd_r 有不错的提升，一方面是诸如 AVX 等 SIMD 指令（对 ARM64 来说，比如 Apple M2，就是针对 706.stockfish_r nnue 的 USDOT 指令，开 -march=native 直接给 706.stockfish_r 加了 33% 的分数，而如果没有这个指令集扩展，那么 -march=native 对 ARM64 没啥性能影响），另一方面就是一些位运算指令，比如 popcnt 和 BMI 扩展；事实上，现在很多软件在实现的时候，就已经考虑了硬件的加速指令，实际编译的时候，往往会直接用对应的 intrinsics，但 SPEC 禁用了这些 intrinsics，退而使用它的 generic 版本，此时就非常依赖 -march=native，以及需要编译器正确识别并翻译为对应的优化指令

还有一些常用的编译参数，比如 -static、-fomit-frame-pointer、-Ofast、-ffast-math 等等，目前没有做太多测试，以后说不定会加上。

编译器版本对比¶

本测试的主要编译器是 GCC 14.2.0，因为它是 Debian Trixie 的编译器版本。有意思的是，即使在 2026 年，随着编译器版本更新，硬件不变的情况下软件性能还在持续增长。GCC 15 能给 706.stockfish_r 生成更快的 SSE/AVX 指令序列，LLVM 22 能识别出 750.sealcrypto_r 的 64 位乘法模式，这些都是很好的例子。此外 LLVM 默认内联 popcount 的优化实现，而 GCC 会转化为对 libgcc 的 popcount 调用，前者代码体积膨胀，后者有额外的 call 开销，这些都会带来可观的性能差距。这些优化其实很具体，完全可以互相移植。在 SPEC INT 2017 时代，基本是 GCC 性能压制 LLVM，而目前 LLVM 凭借 750.sealcrypto_r 的优化相比 GCC 14 扳回一城，又被 GCC 15/16 反超。随着对 SPEC CPU 2026 的研究深入，未来还会编译出更快的程序。

分支预测¶

SPEC INT 2026 Rate 中 MPKI 较高的有：

723.llvm_r MPKI=5.98
729.abc_r MPKI=3.87
777.zstd_r MPKI=3.58
721.gcc_r MPKI=3.37
734.vpr_r MPKI=2.52
707.ntest_r MPKI=2.27
735.gem5_r MPKI=2.05

作为对比，SPEC INT 2017 Rate 的情况：

505.mcf_r MPKI=14.39
541.leela_r MPKI=12.62
557.xz_r MPKI=5.29
531.deepsjeng_r MPKI=4.40
520.omnetpp_r MPKI=4.33
502.gcc_r MPKI=3.13

SPEC INT 2026 Rate 整体低了不少。当然，这是每个 benchmark 的平均值，个别负载可能更高。但无论如何，终于不用和 505.mcf_r 的 spec_qsort 以及 541.leela_r 的 if(randint(2) == 0) 搏斗了。当然，SPEC INT 2026 Rate 也有很多的 MPKI 是来自于 std::map 的红黑树或者其他数据结构，有很多数据依赖的分支，也未必很好从硬件上优化性能。能看到的是，应用程序开始意识到分支预测，并通过 ternary operator 来提示编译器生成 cmov 指令来避免分支的错误预测。

局限性¶

目前的测试仅限于 Intel i9-14900K P-Core，还需要在 ARM64/RISC-V/LoongArch 上做类似的分析。指令集不同，结论应该也会不一样。此外，目前的分析集中在 perf 统计的热点函数上，还可以做更细粒度的分析，比如统计各类指令的使用比例，以及 POPCNT/BMI/AVX 等指令扩展的使用情况。

本文只跑了 Rate 1（单副本）。多副本下内存带宽和缓存竞争会更激烈，MPKI、IPC 等指标可能会有较大差异。此外，分析集中在指令级和分支预测层面，缺少微架构级的深入分析，例如 L1/L2/LLC 的缓存缺失率、TLB miss 等，这些对处理器设计者来说更直接。功耗数据也未纳入考量，综合能效比还需要用 RAPL 等工具进一步测量。最后，PGO（-fprofile-generate / -fprofile-use）也没有尝试，PGO 或许能带来不错的性能提升。

总结¶

本文深入分析了 SPEC CPU 2026 中 INT Rate 的负载，供编译器和处理器的设计者参考。从编译器的角度来说，可以集 GCC 和 LLVM 之长，进一步提升性能；从处理器的角度来说，针对程序的瓶颈进行优化，也能进一步提高分数。

SPEC CPU 2026 在其他指令集上的编译

Thu, 21 May 2026 00:00:00 +0000

SPEC CPU 2026 在其他指令集上的编译¶

SPEC CPU 2026 官方只附带了 aarch64/ppc64le/riscv64/x86_64 指令集的预编译 tools，如果要在其他指令集上使用，就需要首先编译 tools，过程如下：

cd /mnt && tar xvf install_archives/tools-src.tar wget -O config.guess 'https://git.savannah.gnu.org/gitweb/?p=config.git;a=blob_plain;f=config.guess;hb=HEAD' wget -O config.sub 'https://git.savannah.gnu.org/gitweb/?p=config.git;a=blob_plain;f=config.sub;hb=HEAD' cp config.* /mnt/tools/src/make-4.2.1/config/ # build tools mkdir -p /mnt/config cd /mnt && echo 'y' | SKIPTOOLSINTRO=1 FORCE_UNSAFE_CONFIGURE=1 MAKEFLAGS=-j16 ./tools/src/buildtools mkdir -p /mnt/config cd /mnt && . ./shrc && packagetools linux-loong64

例如下面是在 LoongArch 上编译 SPEC CPU 2026 的 Dockerfile，假设 SPEC CPU 2026 已经解压到 /mnt：

RUN cd /mnt && tar xvf install_archives/tools-src.tar RUN wget -O config.guess 'https://git.savannah.gnu.org/gitweb/?p=config.git;a=blob_plain;f=config.guess;hb=HEAD' RUN wget -O config.sub 'https://git.savannah.gnu.org/gitweb/?p=config.git;a=blob_plain;f=config.sub;hb=HEAD' RUN cp config.* /mnt/tools/src/make-4.2.1/config/ # build tools RUN mkdir -p /mnt/config RUN cd /mnt && echo 'y' | SKIPTOOLSINTRO=1 FORCE_UNSAFE_CONFIGURE=1 MAKEFLAGS=-j16 ./tools/src/buildtools RUN mkdir -p /mnt/config RUN cd /mnt && . ./shrc && packagetools linux-loong64 RUN /mnt/install.sh -f

参考官方文档：Building the SPEC CPU®2026 Toolset。

记一次循环依赖导致的运维小事故

Thu, 07 May 2026 00:00:00 +0000

记一次循环依赖导致的运维小事故¶

背景¶

每次没有 UPS 或 UPS 容量不够用的倒闸对于运维来说都是一次鸡飞狗跳。这次很不幸，鸡飞狗跳终于轮到了我，还好花了一个半小时还是解决了。在这里做个简单的复盘。

现象与排查¶

首先介绍一下现象：停电恢复之后，服务器开起来，但是无法从互联网连接内网的网关（即主网关）。还好，之前考虑到网关的重要性，做了备份，走内网的备用网关顺利进入了内网，然后发现主网关即使 IP 地址是对的，也连不上互联网。虽然通过 keepalived 的配置，主网关和备用网关会通过虚拟 IP 给内网机器提供一个高可用的默认网关，但由于 keepalived 只是检测了机器是否开机，并没有判断能否正常访问互联网，所以 keepalived 总会选择优先级更高的主网关，导致虚拟 IP 指向主网关，因而内网的机器都连不上互联网，还得继续尝试修复主网关。

主网关运行在 ESXi 的虚拟机里，于是进入 ESXi 管理网页，看看它的网络情况。这个虚拟机的网络用的并不是 ESXi 的普通网络，而是通过 vCSA 配置的基于 DS 的 LACP。看了几个不同的 ESXi，发现问题都集中在 LACP 上。而 ESXi 是没法配置 DS 的，所以就先去看了 vCSA。连上 vCSA 以后一看，所有的 ESXi 都掉线了。原来，之前为了方便配置，vCSA 都是通过域名连接 ESXi 的，而域名就需要有 DNS 服务器解析，然而主网关连不上互联网，也就连不上配置好的互联网的 DNS 服务器，于是 vCSA 无法配置 ESXi，然后 ESXi 的部分虚拟机就会断网，正好主网关又在被断网的虚拟机里面。这就形成了一个循环依赖。

既然找到了问题，那就需要打破循环依赖：把主网关在 keepalived 里的优先级调低，让备用网关上位。结果这时候发现一个小的问题：备用网关的 NAT 忽然不工作了。排查了一下，发现是因为 net.ipv4.ip_forward = 1 写在了 /etc/sysctl.conf 文件里，而 Debian 升级 Trixie 以后，这个文件已经不会被应用了，而要把内容写到 /etc/sysctl.d/*.conf 里面去，通过 /usr/lib/systemd/systemd-sysctl --cat-config 来确认是否持久化成功。由于主网关一直工作得很好，备用网关很久都没有做 NAT 了，导致这个问题一直没有被发现。

修好以后，vCSA 就能找回 ESXi 了，然后通过 vCSA 再重新配置一下 ESXi 的 DS LACP 网卡，然后一切就恢复了。

反思¶

虽然事故解决了，但这个过程中暴露了很多问题：

首先就是循环依赖：ESXi 的 DS LACP 依赖 vCSA，vCSA 依赖 DNS，DNS 依赖主网关，主网关通过 DS LACP 访问互联网。如果没有准备好备用网关，且备用网关恰好没有用 DS LACP 因此不受影响，那么修复起来就更麻烦了。解决循环依赖的办法也很简单，对于一些重要的虚拟机（如网关），它所依赖的功能越简单越好。
备用网关的 NAT 功能在升级 Debian 版本后，因为没有仔细阅读 sysctl 的变化而失效了，升级后缺少对功能的检查。
keepalived 只判断机器是否在线，但没有判断机器是否可以正常连接互联网、承担起网关的职能。
ESXi + vCSA 这类闭源软件，修复起来还是比较痛苦的，很多内部的工作原理并不清楚，可调试性也比较差，以后还是会谨慎选择。

AI 时代的本科 CS 教育随想

Sun, 12 Apr 2026 00:00:00 +0000

AI 时代的本科 CS 教育随想¶

背景¶

前几天参加了系里的关于 AI 时代的 CS 教育的研究生论坛，在论坛上我分享了一些小的思考，也在论坛上得到了许多不同的想法，于是把一些想法记录下来，过一段时间再回来看看，到底 CS 教育应该怎么办。

叠甲¶

本文仅代表本人观点，不代表本系或本校的观点，请勿扩大解读！请不要让我上 AI 三大顶会（机、量、新），谢谢！但欢迎大家参与到这个讨论当中，因为目前谁也不知道未来应该怎么做。

现状¶

为了让读者了解背景，首先要知道前 AI 时代的 CS 教育大概是怎样的：本科的时候先上编程课，教大家各种编程语言，然后逐渐深入到各个领域，课上讲授知识点，课下通过工程训练来夯实，由于计算机是工科，这里面通过不断的工程实践来获取经验，是很重要的一个部分。这一部分学习过程很辛苦，但是确实很有效果，可以说几乎每一位系友都是这么锻炼过来的。

下面这一段，如果你还在读本科，请不要点开，点开了也请忘掉，按照老师的要求去做：

但是，现在 AI 时代来临，很多事情都发生了变化。首先，AI 编程能力很强，大一同学辛辛苦苦学完一年，然后发现自己写的代码还不如 AI 写得好写得快，内心的挫败感和对这种古法编程的学习方法的质疑是无与伦比的。这对课程的教学产生了很大的冲击，因为人很难克制自己的懒惰，面对巨大的诱惑，其实很难静下心去学习这些已经由 AI 掌握的基础课程。论坛上有同学做了个比喻，计算器被发明了以后，人类没有失去心算的能力，因为你为了去用好计算器，还是要知道这些基础知识，从小学起，然后到某一个年级告诉你可以用计算器，然后各种考试还可以出计算器没法解决的题目。但是，AI 的能力边界太大了，它能解决从简单到困难的各种问题，只是有一定的概率解决出来是错的。其次，即使是前几年我们还会觉得，专业核心课的大作业还很难由 AI 完成，似乎还能通过大作业的难度来倒逼大家学习，但在今年也纷纷沦陷，对于学生来说，只要愿意，完全可以自己不写一行代码，纯让 AI 写一个能通过所有测试的作业，自己完全不了解内部是怎么实现的，用很短的时间完成作业。而且还不好去举证，说这一定是 AI 写的。这一点在这次论坛上，不同课程的助教都做了类似的实验，证明了这一点。虽然发这篇博客可能会让一些本科同学看到，然后不好好写大作业，但还是希望更多教育工作者可以看到并参与讨论。如果你是正在上课的同学，就自觉忘记吧。

怎么办¶

那么，应该怎么办呢？在这里阐述一些我的观点。一个大的前提是，肯定不能完全禁止 AI，也不能完全依赖 AI，需要辩证地把 AI 引入到 CS 教育当中。

首先是关于 CS 教育要培养出什么样的人才。之前，我们要培养的一方面是工程师，在长时间的工程实践当中积累经验，通过自己的经验，可以打造出一个很完备的系统，功能完善，可靠安全。但其实细分看来，在系统的搭建当中，其实有偏向于顶层设计的架构师，也有偏向于具体实现的工程师。目前 AI 已经可以很快地针对一个给定的 Plan 去做实现，并且实现得还不错，但是从需求到 Plan 的这一步，其实还需要人类的专家知识，因为实际的需求往往很复杂，会有许多大模型没有学过的假设与背景，这需要架构师脑子里把架构想清楚，知道哪里应该怎么做，然后把一部分的工程实现外包给 AI，自己再保证它的实现质量，确保它忠实地实现了所设计的架构，并且实现的系统是可靠安全的。用 AI 写代码很容易，但是写出来复杂可靠的软硬件系统，依然不是容易的事情。另一方面是科学家，在科研方面，科研的品味（Taste）变得更加重要，因为许多科研，其工程量本来就很小，完全可以由 AI 代劳，那么谁能够找到正确的路径，谁才能更好地与 AI 协作，完成科研。换句话说，以后的每个科研工作者，可能自己都是通讯作者，手下是一堆 AI 博士生在做实验，自己提出研究的思路，由 AI 实现和写作，然后自己来保证整个过程的正确性和学术伦理。无论是哪个方向，重点都从以前的知道某个东西“是什么”，变成了“为什么”，进而能够判断“对不对”。论坛上有同学总结得好，人类会更多地变成一个鉴别器（Discriminator）。

那么，具体到课程上，应该怎么做呢？其实我也没有想太明白，需要在未来几年里通过实践来不断修正。目前的一些初步的想法主要有下面这几点：

首先，作业已经不再能区分同学，不能代表同学对知识的掌握情况，只能代表 AI 对知识的掌握情况。所以作业已经完全沦为 AI 的课后送分小练习，在目前这个卷绩点的氛围下，让大家都开开心心地拿作业满分，也是越来越普遍了。如果真的想要通过作业来督促同学进行学习，那就必须回归作为人类的基本功，就是通过更多的线下的口语、展示和对话，以最“人味”的方式对抗 AI 的“机味”。事实上，在目前这个时代，其实如何扩大自己的影响力，也是很重要的技能，真的是酒香也怕巷子深，如何能够让大家看到你，抓住大家的注意力（Attention），很多时候会比你做出来的东西有多好更重要。这些能力，其实是值得通过作业的设计来培养的。我在本科的时候，尝试选了一次演讲的课程，当时看到作业要求，人直接麻了：需要每个人在班级所有人面前做演讲，这对于当时还比较社恐的我，由于太过害怕直接退课了。现在想想，其实都是小意思，当你迈出那一步以后，会发现懂得大大方方展示自己，真的是很重要的能力，是 AI 暂时还无法取代的能力。

既然作业沦陷了，那么，怎么打分呢？难道让每个人都能拿到满绩点？几年前，我在和大一新生聊天，他们就对这个打分的事情感到困惑，因为在目前这个绩点膨胀的时代里，好像很多课程拿满绩点都是天经地义的事情，如果你这个课不给我满绩点，我就要给你打教评低分。但是，又有不少东西和绩点挂钩，奖学金，保研等等。老师当然可以撒手不管，让所有人满绩点，但这只是让竞争延后、转化为其他领域了而已，不比绩点，那就比谁更能在本科的时候做科研，打比赛等等。另一方面，打分也是一个很重要的督促学习的手段，还是一样的前提，人类是很难抵抗自己的懒惰的，如果不是为了毕业以后有更好的发展，可能会有很多人放弃毕业、放弃学习。以前，为了能够顺利毕业，还会咬咬牙做一些比较困难的学习，甚至可能是自己不喜欢的；现在可以用 AI 糊弄了，那就糊弄过去，反正分数不错，能给父母交差，大环境也不好，然后就陷入了虚无主义，一泻千里。所以，似乎考试成为了最后的防线，还能在一定程度上督促学习。

但其实考试也受到了巨大的冲击。第一个问题就是，考试是否允许使用 AI 呢？许多 CS 课程，未来都会或多或少地引入一些 AI，那么学生对 AI 的掌握程度，也是一个需要考核的能力。但目前不同厂商的 AI 的可用性与性能差距过大，“AI 平权”会成为一个新的问题，我们希望比的是谁更会用 AI，而不是谁能用上更好的 AI。就像高考作文要考虑贫富差距一样，本科课程的考试也会面临类似的问题。一种可能性是在考试的时候提供统一的 AI 访问，但目前 AI 生态还是比较混乱，指定一个 AI 让大家用，其实也很容易出现与学生平时使用工具或生态不兼容的问题，而且学校自己部署一个同时几百上千人同时用的 AI 服务，也不是一件容易的事情，希望未来有云厂商可以提供类似的服务，并且能够控制住成本，其实就是一个持续两小时的上百 QPS 的专属推理服务。如果要类比的话，其他一些学科允许使用计算器，出题的时候可以规避，但 AI 能做的事情太杂了，其实很难针对。

另一方面，如果禁止 AI 的话，也有很多问题。首先是没法考察学生的 AI 使用能力，这个在未来会更加重要。其次，学生自己会比较难接受，先给了 AI 这么方便的工具，结果期末考试又要古法做一遍，最后结果可能就是学期中都在用 AI，只有考试前一周突击一下，考完就忘了，当然，好像现在很多人也是这样呢。而且课程很容易被贴上“不与时俱进”的标签，就如那些用十几年前课件的课程一样。现在这个过渡时期，大家都知道会变，但是怎么变并没有达成共识，所以一定会有一个阵痛期。如果你是刚上本科，或者马上要上本科的高中同学，那就要做好成为小白鼠的准备了。此外，随着本地模型的发展，如果让学生带电脑，即使不给联网，有更好的独立显卡的同学，事实上可以通过电脑配置的优势转化为分数，这也会带来新的不公平性。

当然，也不是毫无希望，比如前面说的，加一些有“人味”的考核，唯一的缺点是人力需求较大，难以扩展；或者允许使用 AI，但是必须提交完整的 AI 使用记录，这一点很多地方已经在实践；出题的时候，可能也要想办法去考察学生的思路，一些可以由 AI 完成的作业，不如就直接让学生用 AI 做，变成考察 AI 使用能力的题目。

讨论¶

以上基本是我在论坛上所展示的内容，下面也分享一些我在论坛上了解到的一些情况，以及所引发的思考。

首先，这次论坛不仅有大量的研究生助教参与，也有许多一线的教学老师参与了讨论。其实老师们感受到的冲击也很直接，因为可能就是从 2025-2026 开始，就有一批学生可以完全不接触古法编程，直接上手写代码，用一种完全不同的学习方法来学习各种课程内容。有的人可以很好地利用 AI 加速自己的学习，比如之前需要花费很多时间做的工程实践，现在可以在相同时间内用 AI 做更多的实践，一样可以获得很多甚至更多的实践经验。有的人就完全依赖 AI，可以糙快猛地完成很多事情，但对内部工作一概不知，能做的事情完全取决于 AI 的能力边界，同时自己又缺乏很多基础知识，可以说上知天文下知地理，但是四体不勤五谷不分。现在大家心里没底的就是，AI 的能力是否可以无限扩展，自己只需要站在 AI 的肩膀上，坐等 AI 发火箭上月球就行；还是需要脚踏实地，踩着地月天梯去月球。

咱也不知道答案，就在实践中前行吧。

AI 时代的本科（非 CS）教育随想¶

也顺带聊聊 CS 以外的教育吧，其实它们受到的冲击并不比 CS 少。但从某种意义来说，对于很多学科而言，AI 给每个人都带来了特别强大的工具，而且由于本来也不是学 CS 的，用 AI 能写出以前自己写不出来的代码，一下就把能力范围拓宽了。即便受到 AI 能力的限制，但反正自己也不是干这行的，本来也达不到那个上限，自然也就无所谓了。所以其实在 AI 时代，CS 以外的学科，都很值得学会怎么用 AI，给自己的学科赋能。比如论坛上有同学举了个例子，像写网站这种事，几天之内就能由来自不同学科、可能完全没有基础的同学，各自写出不同的校内交友相亲网站，而且还能让大模型帮忙做运维。好的想法、合适的商机、宣传和包装，这些才是更重要的，不用担心自己做不出来。

SDRAM 在不同访存模式下的带宽分析与实验

Thu, 26 Mar 2026 00:00:00 +0000

SDRAM 在不同访存模式下的带宽分析与实验¶

背景¶

最近在和 @CircuitCoder 交流 SDRAM（通常简写为 DRAM，或更进一步简写为 DDR）的各种性能指标，于是想到利用现有的 DRAMSim3 和 Ramulator2 做一些模拟测试，看看各种访存模式下可以实现峰值带宽的多少比例，再结合时序验证理论与模拟结果是否吻合。实验相关代码已开源至 jiegec/dram-bench。

SDRAM 背景¶

首先简单回顾 SDRAM 的背景，我的知识库中有更详细的介绍，这里仅提炼几个便于理解后续内容的要点，完整的 SDRAM 介绍请移步知识库：

SDRAM 由多级层次组成：
- Channel：对应内存控制器的通道数量，通常每个 Channel 对应 64 位的数据总线
- Rank：每个 Channel 内可能有多个 Rank，这些 Rank 共享总线
- Bank Group：在 DDR4 引入，每个 Rank 有多个 Bank Group
- Bank：每个 Bank Group 有多个 Bank
- Row：每个 Bank 内部同时只有一个 Row 被激活
- Column：激活的 Row 内，每个 Column 对应保存数据的 Cell
如何读写 SDRAM 中的数据：
- 首先根据数据的地址找到对应的 Channel/Rank/Bank Group/Bank/Row/Column，如：
  - Row 地址等于地址的 [33:18] 位，共 65536 个 Row
  - Rank 地址等于地址的 [17:17] 位，共 2 个 Rank
  - Bank 地址等于地址的 [16:15] 位，每个 Bank Group 内有 4 个 Bank
  - Bank Group 地址等于地址的 [14:13] 位，共 4 个 Bank Group
  - Column 地址等于地址的 [12:6] 位，共 1024 个 Column，每 8 个 Column 为一个 Burst
- 通过 Activate 命令激活对应的 Row，如已激活可跳过，如当前激活了其他 Row，则需要先执行 Precharge 命令
- 读写 Row 中保存的数据
SDRAM 中可能的性能瓶颈：
- 在 Row 内连续访问数据很快，但如果要访问的数据位于不同 Row，就需要频繁执行 Activate 和 Precharge
- SDRAM 有周期性的 Refresh，会导致部分时间无法访问数据
- 额外的时序参数，对各类命令的顺序和间隔提出了约束：
  - tCCD：两次 Read 之间的最小间隔
  - tREFI：平均 Refresh 间隔
  - tRFC：Refresh 到下一个 Activate/Refresh 的最小间隔
  - tRTP：同一个 Bank 的 Read 到 Precharge 的最小间隔
  - tRP：同一个 Bank 的 Precharge 到下一个命令的最小间隔
  - tRCD：同一个 Bank 的 Activate 到 Read/Write 的最小间隔
  - tRAS：同一个 Bank 的 Activate 到 Precharge 的最小间隔
- 如何计算峰值带宽：按接口速率（需考虑 DDR）乘以总线位宽可得峰值带宽，但由于上述瓶颈，实际无法达到该值

不同访存模式下的带宽分析与实验结果¶

顺序访存¶

首先考虑最经典的顺序访存，从地址 0 开始，以 64 字节为跨步访问。直觉上顺序访存似乎能实现最大带宽，但实际未必如此。例如以下测试结果中，DDR3 确实接近峰值，而 DDR4 则相差甚远：

模拟 DDR3-1866，带宽达到峰值的 95.6%
模拟 DDR4-3200，带宽达到峰值的 66.4%

DDR3¶

先分析 DDR3-1866 的模拟结果。实验中发出 50000 次 Read，其中 49772 次命中了已激活的 Row，无需额外 Activate 或 Precharge；此外还有 53 次 Refresh，228 次 Activate 和 222 次 Precharge。由于 DDR3-1866 的时序参数中，tCCD（两次 Read 之间的最小间隔）仅为 4 个周期，而一次 Burst 为 8 拍，因为 DDR 在时钟上下边沿都传输数据，所以一次 Read 正好占用数据总线 4 个周期，因此如果所有命令都是 Read，理论上可以完美衔接，不浪费任何带宽。既然实测只有 95% 左右，必定是其他命令引入了空泡：

Activate/Precharge：在顺序访存模式下，当一个 Row 的数据全部被访问后，就要进入下一个 Row，此时需要一次 Precharge 和一次 Activate。一个 Row 内有 2048 个 Column，意味着需要执行 $2048/8=256$ 次 Read 才能遍历完一个 Row，因此 50000 次 Read 对应约 $50000/256=195.3$ 次 Activate/Precharge。此外，Refresh 之前不能有激活的 Row，所以还需要少量额外的 Activate/Precharge 来配合 Refresh。
Refresh：DDR3 SDRAM 要求平均每 tREFI 时间进行一次 Refresh，这里 tREFI 等于 7800 个周期。考虑到有两个 Rank 需要分别 Refresh，因此在 209168 个周期内，需要进行约 $209168\times2/7800=53.6$ 次 Refresh，与实际基本吻合。

尝试理论计算：每 $x$ 次 Read，对应 $x/256$ 次因 Row 结束带来的 Activate/Precharge，每轮 Activate/Precharge 带来 $\mathrm{tRTP}+\mathrm{tRP}+\mathrm{tRCD}$ 的开销；此外在大约 $4x$ 个周期内，每个 Rank 还需进行 $4x/\mathrm{tREFI}$ 次 Refresh，每次 Refresh 带来约 $\mathrm{tRFC}$ 的开销。将这些开销汇总，代入时序参数计算得到约 $0.30x$ 的额外周期数。但实际上，Activate/Precharge 的部分开销可以通过 Bank 级交错来隐藏，比如在访问一个 Bank 的同时，提前对下一个 Bank 执行 Activate/Precharge，因此主要开销来自 Refresh。即使只考虑一个 Rank 内的 Refresh 开销，也有约 $0.17x$ 的额外周期数，此时带宽约为峰值的 $4x/(4x+0.17x)=0.959$ 倍，与实际测得的 95.6% 高度吻合。

DDR4¶

但 DDR4 的带宽比例显著下降，显然出现了新瓶颈。DDR4 相比 DDR3 一个重大改动是，原本一个 Rank 内只有 Bank，现在一个 Rank 包含多个 Bank Group，每个 Bank Group 内又有多个 Bank。这种分层是因为 Bank Group 内部的 tCCD 无法像 DDR3 那样保持在 4 个周期，只能退化为 5-8 个周期，这个新时序参数称为 tCCD_L（L 代表 Long）；而 Bank Group 之间的 tCCD 仍能保持在 4 个周期。这意味着在 DDR4 下，只有交替对不同 Bank Group 发送 Read 命令，才能逼近峰值带宽；一旦局限在某个 Bank Group 内部，每次 Read 需间隔 tCCD_L 个周期，而每次 Read 仅提供 4 个周期的数据，导致巨大的带宽浪费。特别是在 DDR4-3200 速率下，tCCD_L 长达 8 个周期，数据总线有一半时间处于空闲。

为验证这一点，额外做了一个测试：不再单纯顺序访存，而是固定一个 Bank Group，交错读取不同 Bank，每个 Bank 内顺序访问 Row 和 Column，最终测得的带宽仅为峰值的 47.5%，这大致是考虑 Refresh 后数据带宽减半的结果。按前述 DDR3 的分析方法，计算此时 Refresh 的开销：每 $x$ 次 Read，对应 $8x\times\mathrm{tRFC}/\mathrm{tREFI}$ 的周期开销，代入时序参数约为 $0.36x$，性能可达峰值的 $4x/(8x+0.36x)=0.478$ 倍，与实际测试的 47.5% 高度吻合。

再回到顺序访存，为何能实现 66.4% 的峰值带宽？注意刚才假设访存总是映射到同一个 Bank Group，而 66.4% 突破了 47.5% 的极限，意味着必然访问了多个 Bank Group。此时需要深入分析地址映射方式，它采用的 RoChRaBaBgCo 映射方法，意味着从地址高位到低位依次是 Row、Channel、Rank、Bank、Bank Group 和 Column。因此随着地址每次增加 64，当 Column 溢出时就会访问下一个 Bank Group，两个 Bank Group 的 Read 命令可以交错执行，填补流水线空档。如果改变映射顺序，会得到不同结果：

将 Bank Group（Bg）从地址低位挪到高位：
- RoChRaBaCoBg：95.2%
- RoChRaBaBgCo：66.4%
- RoChRaBgBaCo：51.0%
- RoChBgRaBaCo：49.4%
- RoBgChRaBaCo：49.4%
- BgRoChRaBaCo：49.4%
进一步调整 Rank（Ra）的位置：
- BgRoChBaCoRa：76.6%
- BgRoChBaRaCo：57.5%
- BgRoChRaBaCo：49.4%
- BgRaRoChBaCo：47.5%

可见，Bank Group 地址越向高位移动，带宽越低，说明 Bank Group 交错的频率降低，性能随之下降；除了 Bank Group，Rank 之间也可以交错来掩盖部分延迟，但效果不如 Bank Group 交错显著；若两者都置于最高位，则退化为前述 47.5% 的带宽，即数据总线一半时间为空泡，再加上 Refresh 开销。

再回头看 DDR3 的分析：若只考虑 Refresh 带来的性能损耗，理论上限为 95.9% 带宽，实际达到 95.6%；若将 Activate/Precharge 的损耗也计入，理论上限仅为 $4x/(4x+0.30x)=0.930$ 倍峰值，低于 95.6%，这说明在顺序访存模式下，通过地址映射在 Bank 或 Rank 层面实现了交错，从而隐藏了一部分延迟。为此再进行一组实验：仅访问一个 Bank 内的连续 Row 和 Column，测得带宽为峰值的 92.7%，与分析基本吻合。

小结¶

即使是简单的顺序访存，由于地址映射的存在，地址的连续变化会映射到不同的 SDRAM 层次，从而产生不同的性能表现。例如，在 DDR3 上，通过 Bank 和 Rank 的交错，可以隐藏一部分 Activate/Precharge 开销，仅剩 Refresh 开销无法避免；在 DDR4 上，根据地址映射的不同，若能在 Bank Group 层面实现细粒度的交错，就能充分利用更短的 tCCD_S 填满数据总线；否则会产生大量空泡，最坏情况下带宽降至 $4/\mathrm{tCCD_L}$ 的比例。

随机访存¶

与顺序访存相对的另一个极端是随机访存：访问地址随机分布在各种 Bank 和 Row 上，此时 Row 命中率很低，几乎每次 Read 之前都需要 Precharge 和 Activate。在这种场景下，只能依靠 Bank 等层次上的交错来尽量掩盖开销。

DDR3¶

从 DDR3-1866 实验数据可以明显看出随机访存与顺序访存的差异：同样是 50000 次 Read，顺序访存仅有 228 次 Activate 和 222 次 Precharge，而随机访存则达到了 50086 次 Activate 和 50078 次 Precharge。接下来尝试理论分析该场景下的性能。首先，在每个 Bank 内，循环执行 Activate-Read-Precharge，这一组操作至少耗时 $\mathrm{tRAS}+\mathrm{tRP}$；其次，若共有 8 个 Bank（为简化，固定只用一个 Rank），则这 8 个 Bank 可以交错执行 Activate-Read-Precharge 循环，理想情况下在 $\mathrm{tRAS}+\mathrm{tRP}$ 时间内，8 个 Bank 各可完成一次 Read。代入时序参数，推测带宽为峰值的 $4\times8/45=0.71$ 倍，但实际仅测到 46.0%，说明还存在其他瓶颈。事实上，这里需要考虑另一个时序参数 tFAW，其含义是在连续的 tFAW 时间内，最多只能有 4 次 Activate，且该限制跨 Bank 生效。因此即使有 8 个 Bank，实际也只能达到 $4\times4/\mathrm{tFAW}=0.485$ 倍的峰值性能，与模拟值已较为接近，还需考虑 Refresh 开销。在另一组 DDR3-1866 时序参数下，tFAW 为 26 个周期，理论值为 $4\times4/26=0.615$ 倍峰值，模拟结果为 57.7%，同样比较接近。

DDR4¶

DDR4-3200 的情况类似。当 tFAW 为 34 个周期时，理论值为 $4\times4/34=0.471$ 倍峰值，模拟结果为 44.5%。尽管 DDR4-3200 有 4 个 Bank Group，每个 Bank Group 内含 4 个 Bank，总共 16 个 Bank，但在频繁 Activate 的场景下，依然受限于 tFAW。

小结¶

因此，即使是随机访存，只要能将请求分散到不同 Bank 上，性能依然可以接受。当然，随机访存的困境还体现在其他方面：缓存命中率低，且每个缓存行可能只用到少量数据就被丢弃。

对同一个 Bank 的随机访存¶

前面分析提到，Bank 交错可以在一定程度上掩盖 Activate-Precharge 的开销，但如果连这种掩盖也失效了，会发生什么？下面进行一组模拟，固定在某 Bank Group 内的一个 Bank 中，对其内部随机 Row 进行访问。

DDR3¶

仍以 DDR3-1866 时序参数为例进行理论分析：每 $\mathrm{tRAS}+\mathrm{tRP}$ 时间只能完成一次 Read 操作，因此带宽仅为峰值的 $4/(\mathrm{tRAS}+\mathrm{tRP})$ 倍。代入实际时序参数得 $4/(32+13)=0.089$ 倍，模拟结果为 8.5%，与理论分析吻合。

DDR4¶

DDR4-3200 同样如此，代入时序参数得 $4/(52+22)=0.054$ 倍，实际模拟结果为 5.2%，基本吻合。

小结¶

因此，若对同一 Bank 频繁进行随机访存，性能将显著下降。不过，由于地址映射机制的存在，Row 通常位于地址的高位，在实际应用中，绕过 Bank 与 Bank Group 对应的地址位、直接在 Row 地址位上进行随机访问的概率相对较低；然而一旦发生，对性能的影响将是毁灭性的。为此，研究人员提出了一些更为复杂的地址映射模式，例如在选取地址特定位的基础上引入异或运算，或进一步采用 Row Indirection Table，实现从逻辑 Row 到物理 Row 的映射，甚至动态交换特定 Row 中的数据。

讨论¶

上述测试均针对 DDR3 和 DDR4 的读请求展开，那么这些结论对写请求，或者对新一代的 DDR5 会产生怎样的影响呢？

首先，如果将读操作替换为写操作，上述分析基本依然成立：无论是读还是写，占用数据总线的时间相同，虽然时序上略有差异，但瓶颈主要在于 Activate、Precharge、Refresh 等操作，这些方面读和写并无本质区别。模拟结果也证实了这一点，读与写的带宽相差不大。

另一方面，DDR5 相较于 DDR4，主要有两点不同。其一，为支持更高频率，DDR5 将预取（Prefetch）位数从 8n 提升至 16n，即一次突发传输（Burst）包含 16 次传输，对应 8 个时钟周期。同时，为保证每次传输仍为 64 字节，原有的 64 位宽的 Channel 被拆分为两个 32 位宽的 SubChannel。因此，本质上 DDR5 将 Channel 数量翻倍，每个 Channel 内部的数据位宽减为 32 位，突发长度翻倍。这使得一次读写操作将占用数据总线 8 个周期，而不再是先前的 4 个周期，因此上述分析中的相关数值需作相应调整。其二，DDR5 进一步增加了 Bank Group 数量，由 4 组提升至 8 组，从而更容易触发 tCCD_S 而非 tCCD_L。

总结¶

简单总结上述分析，根据访存模式的不同：

顺序访存：DDR3 基本可以打满带宽，DDR4 则取决于地址映射能否在 Bank Group 层面实现细粒度的交错，如果交错频率较低，就会吃到 tCCD_L 的延迟惩罚，打不满带宽
随机访存：借助 Bank 交错，随机访存也能达到约一半的峰值带宽，主要受 tFAW 限制
对同一个 Bank 的随机访存：无法隐藏 Activate-Read-Precharge 延迟，性能最低，受限于 tRAS+tRP

如果读者感兴趣，也可以在代码基础上添加其他访存模式，进一步探索性能表现。

Nginx 反代导致 SSE 延迟变高的问题与解决方法

Thu, 05 Mar 2026 00:00:00 +0000

Nginx 反代导致 SSE 延迟变高的问题与解决方法¶

背景¶

最近有同学遇到这么一个问题：在 Nginx 反代后面搭了一个使用 SSE（Server Sent Events）机制的服务端，但客户端观察到请求延迟比较高，数据批量到达，而不是一行一行地出现。经过排查，发现是 Nginx 的 buffering 机制导致的。本文通过实验复现该问题，并探索了几种解决方法。

问题复现¶

为了复现这个问题，我 Vibe Coding 了一个测试服务端 server.py，监听 8080 端口，在 /events 路径下每秒发送一条 SSE 消息，共发送 5 次：

#!/usr/bin/env python3 """SSE server that sends 5 messages, one every second."""  import time from http.server import HTTPServer, BaseHTTPRequestHandler   class SSEHandler(BaseHTTPRequestHandler):  def do_GET(self):  if self.path == "/events":  self.send_response(200)  self.send_header("Content-Type", "text/event-stream")  self.end_headers()   for i in range(5):  message = f"data: Message {i + 1} at {time.time()}\n\n"  self.wfile.write(message.encode("utf-8"))  self.wfile.flush()  time.sleep(1)   self.wfile.close()  else:  self.send_response(404)  self.end_headers()   def log_message(self, format, *args):  print(f"[{time.strftime('%Y-%m-%d %H:%M:%S')}] {format % args}")   if __name__ == "__main__":  server = HTTPServer(("0.0.0.0", 8080), SSEHandler)  print("SSE server starting on http://0.0.0.0:8080")  server.serve_forever()

启动服务端，使用 curl 访问 localhost:8080/events，可以看到每秒输出一条消息，没有延迟。接下来在 docker compose 里启动 Nginx，配置如下：

services:  nginx:  image: nginx:alpine  ports:  - "80:80"  volumes:  - ./nginx.conf:/etc/nginx/nginx.conf:ro  depends_on:  - sse-server  networks:  - sse-network   sse-server:  image: python:3.11-slim  command: python /app/server.py  volumes:  - ./server.py:/app/server.py:ro  ports:  - "8080:8080"  networks:  - sse-network  networks:  sse-network:  driver: bridge

接着是 nginx 的配置：

events {  worker_connections 1024; }  http {  server {  listen 80;   location /events {  proxy_pass http://sse-server:8080;   # Add additional config later here  }  } }

启动 docker compose，用 curl 分别访问 80 和 8080 端口的 /events，观察到以下现象：

通过 80 端口访问 nginx：5 秒后一次性输出所有 data
通过 8080 端口直接访问 server：每秒输出一条 data

这说明确实是 nginx 导致的。接下来测试几种解决方法。

解决方法¶

首先，查阅 nginx 的文档，可以看到它的描述：

Syntax: proxy_buffering on | off; Default: proxy_buffering on; Context: http, server, location  Enables or disables buffering of responses from the proxied server.  When buffering is enabled, nginx receives a response from the proxied server as soon as possible, saving it into the buffers set by the proxy_buffer_size and proxy_buffers directives. If the whole response does not fit into memory, a part of it can be saved to a temporary file on the disk. Writing to temporary files is controlled by the proxy_max_temp_file_size and proxy_temp_file_write_size directives.  When buffering is disabled, the response is passed to a client synchronously, immediately as it is received. nginx will not try to read the whole response from the proxied server. The maximum size of the data that nginx can receive from the server at a time is set by the proxy_buffer_size directive.  Buffering can also be enabled or disabled by passing “yes” or “no” in the “X-Accel-Buffering” response header field. This capability can be disabled using the proxy_ignore_headers directive.

根据描述，可以想到一些可能的解决方法：

Nginx 配置添加 proxy_buffering off;：工作
服务端在响应的 header 里添加 X-Accel-Buffering: no（self.send_header("X-Accel-Buffering", "no")）：工作

在一开头的场景里，由于中间的 Nginx 配置改起来比较麻烦，最后就用了第二种方法。回想起来，一开始思路走偏了，一直在往 cache 方向想，实际上是 buffering 的问题：Nginx 会先从 server 读取一大片数据，攒够了再发给 client，避免来回转发小段数据的开销，但 SSE 又希望有较低的延迟，这就冲突了。

小结一下：排查这类问题要理解 Nginx 的工作机制，找错方向可能很难定位；同时，利用 LLM 快速构建可复现的测试环境，有助于验证假设。

记一次软 RAID1 坏盘的恢复过程

Wed, 21 Jan 2026 00:00:00 +0000

记一次软 RAID1 坏盘的恢复过程¶

背景¶

最近遇到一个运维场景，两个 SATA 盘组了一个 RAID1，Linux 的根系统也在上面，启动时能进内核，但是内核一直在报错 link is too slow to respond, please be patient 以及 COMRESET failed (errno=-16)。下面记录一下故障排查以及恢复的过程。

恢复过程¶

考虑到 Linux 系统也在 RAID1 上面，所以找了另一台机器，接上两个 SATA 盘，然后观察到，其中一个盘直接无法识别，另一个盘可以正常访问，但它分区表里只有一个分区，参与到了 md 组的 RAID1 当中。遇到盘坏了又是 RAID，第一反应是买一个新盘，然后重建 RAID。但是一通询价，发现最近硬盘价格涨的比较多，所以先尝试如何单盘启动。由于是 UEFI 启动，推测 ESP 在已经坏的那个盘上面，好的盘上并没有 ESP，但它唯一的分区已经占满了整个空间，所以第一步是对 RAID 分区缩容，这就需要：

首先用 fsck -f /dev/md0 && resize2fs /dev/md0 newsize 对根分区进行缩容
用 mdadm --grow --size=newsize /dev/md0 对 RAID 进行缩容
停止 RAID：mdadm --stop /dev/md0
重新分区，缩小 RAID 分区大小：cfdisk /dev/sda
重新启动 RAID，更新 device size：mdadm --assemble --update=devicesize /dev/md0 /dev/sda1

这些步骤完成以后，就可以在空余的空间里建 ESP 分区了：建分区，mkfs.vfat，挂载到 /mnt/boot/efi（假设 /dev/sda1 已经挂载到了 /mnt），接着 arch-chroot /mnt（或者手抄 Archlinux Wiki），进去 grub-install，修改 /etc/fstab，重新 update-grub。

这个过程中，踩了一些小坑，比如：

重启以后直接进 grub shell，没有菜单显示出来，后来发现是 UEFI 启动项里有之前的旧残留，导致 grub 没有能够正确加载 ESP 里面的 grub.cfg，如果在 grub shell 里手动 source 一下是正常的
如果不更新 device size，那么 assemble 的时候会说 does not have a valid v1.2 superblock 报错，实际上就是它记录了旧的分区大小，和新的分区大小不匹配，此时要强制修改它
最后买了个新盘，但是不够大：960GB vs 1TB，导致如果要重组 RAID1 还得再缩小一次已有的 RAID1 分区，之前缩小的时候只给 ESP 预留了足够的空间，但分区还不够小到能够在新盘里建一个相同大小的分区

IBM POWER9 微架构评测

Sat, 17 Jan 2026 00:00:00 +0000

IBM POWER9 微架构评测¶

背景¶

继 IBM POWER8 之后，也来评测一下后续的 IBM POWER9 微架构。IBM POWER9 有 SMT4 和 SMT8 两种版本，我只有 SMT4 版本的测试环境，下列所有评测都是针对 SMT4 版本进行测试。

官方信息¶

IBM 关于 POWER9 微架构有如下公开信息：

下面分各个模块分别记录官方提供的信息，以及实测的结果。官方信息与实测结果一致的数据会加粗。

Benchmark¶

IBM POWER9 的性能测试结果见 SPEC。

前端¶

L1 ICache¶

官方信息：32KB(SMT4)/64KB(split into 2 regions, SMT8)

为了测试 L1 ICache 容量，构造一个具有巨大指令 footprint 的循环，由大量的 nop 和最后的分支指令组成。观察在不同 footprint 大小下的 IPC：

测试环境是 SMT4 Core，所以只有 32KB 的容量。超出 L1 ICache 容量后，IPC 从 6 降低到了 4.7。相比 POWER8，容量不变，超出 ICache 容量后的 IPC 提高了。

测试过程详见测试代码。

取指带宽¶

官方信息：32 bytes/cycle

为了测试实际的 Fetch 宽度，参考如何测量真正的取指带宽（I-fetch width） - JamesAslan 构造了测试。

其原理是当 Fetch 要跨页的时候，由于两个相邻页可能映射到不同的物理地址，如果要支持单周期跨页取指，需要查询两次 ITLB，或者 ITLB 需要把相邻两个页的映射存在一起。这个场景一般比较少，处理器很少会针对这种特殊情况做优化，但也不是没有。经过测试，把循环放在两个页的边界上，发现 IBM POWER9 微架构遇到跨页的取指时确实会拆成两个周期来进行。

在此基础上，构造一个循环，循环的第一条指令放在第一个页的最后四个字节，其余指令放第二个页上，那么每次循环的取指时间，就是一个周期（读取第一个页内的指令）加上第二个页内指令需要 Fetch 的周期数，多的这一个周期就足以把 Fetch 宽度从后端限制中区分开，实验结果如下：

图中蓝线（cross-page）表示的就是上面所述的第一条指令放一个页，其余指令放第二个页的情况，横坐标是第二个页内的指令数，那么一次循环的指令数等于横坐标 +1。纵坐标是运行很多次循环的总 cycle 数除以循环次数，也就是平均每次循环耗费的周期数。可以看到每 8 条指令会多一个周期，因此 IBM POWER9 的前端取指宽度确实是 8 条指令即 32 字节。

为了确认这个瓶颈是由取指造成的，又构造了一组实验，把循环的所有指令都放到一个页中，这个时候 Fetch 不再成为瓶颈（图中 aligned），两个曲线的对比可以明确地得出上述结论。

随着指令数进一步增加，最终瓶颈在每周期执行的 NOP 指令数，因此两条线重合。

测试过程详见测试代码。

L1 ITLB¶

为了测试 L1 ITLB 的容量，构造 b 序列，每个 b 在一个单独的页（64KB 的页大小）中，观察 b 的性能：

可以看到明显的 256 pages 的拐点，对应了 256 entry 的 L1 ITLB。CPI 从 3 升高到了 28。相比 POWER8 的 64-entry L1 ITLB 容量有所提升。

测试过程详见测试代码。

BTB (aka Branch Target Address Calculator, BTAC)¶

官方信息：1 cycle latency

Return Address Stack¶

构造不同深度的调用链，测试每次调用花费的时间，得到如下测试结果：

可以看到 64 的拐点，对应的就是 RAS 的大小。

测试过程详见测试代码。

CBP (Conditional Branch Predictor)¶

官方信息：BHT(3 cycle redirect) + TAGE(4 components, 5 cycle redirect), 256-bit LGHB(long global history vector)

Dispatch¶

官方信息：6 instructions per SMT4, 12 instructions per SMT8

后端¶

ROB (aka ICT)¶

官方信息：256 entries per SMT4 core

把两个独立的 long latency pointer chasing load 放在循环的头和尾，中间用 NOP 填充，当 NOP 填满了 ROB，第二个 pointer chasing load 无法提前执行，导致性能下降。测试结果如下：

拐点在 256 附近。相比 POWER8 的 28*6=168 有所提升

测试过程详见测试代码。

Issue Queue¶

官方信息：54 instructions per SMT4 core, 108 instructions per SMT8 core

L1 DCache¶

官方信息：32KB(SMT4)/64KB(SMT8, split into two regions)

L1 DTLB¶

用类似测 L1 DCache 的方法测试 L1 DTLB 容量，只不过这次 pointer chasing 链的指针分布在不同的 64KB page 上，使得 DTLB 成为瓶颈：

可以看到 256 Page 出现了明显的拐点，对应的就是 256 的 L1 DTLB 容量。没有超出 L1 DTLB 容量前，Load to use latency 是 4 cycle。L1 DTLB 容量相比 POWER8 的 48(ST)/96(SMT) 有所提升，和 POWER8 的 256-entry L2 DTLB 容量相同。

测试过程详见测试代码。

L2 Cache¶

官方信息：8-way 512KB L2 cache

L3 Cache¶

官方信息：20-way 10MB eDRAM L3 cache per core

Prefetcher¶

参考 Battling the Prefetcher: Exploring Coffee Lake (Part 1) 的方式，研究预取器的行为：分配一片内存，把数据从缓存中 flush 掉，再按照特定的访存模式访问，触发预取器，最后测量访问每个缓存行的时间，从而得到预取器预取了哪些缓存行的信息。

首先是连续访问若干个 128B cacheline，观察哪些被预取了进来：

预取的行为相比 POWER8 更加激进：有更多的缓存行被预取到了更近的 L1（或者是 L2？）。

如果是访问了几个分立的缓存行，有时会表现出 Next 3 Line 的行为，但都是到 L3：

测试过程详见测试代码。

总结¶

POWER9 相比 POWER8 是一次较大的架构升级，主要变化：

L1 ITLB 从 64 项大幅增加到 256 项
增加了 BTAC（BTB），分支预测时延降至 1 周期
RAS 从 32 项增加到 64 项
条件分支预测器从 LBHT+GBHT+GSEL 升级为 TAGE
ROB 从 GCT 的 28 Group（约 168 指令）改为 256 项 ICT，粒度从 Group 改为指令
L1 DTLB 从 48(ST)/96(SMT) 项增加到 256 项

虽然升级了不少，但迭代速度已经赶不上竞争对手了。

IBM POWER8 微架构评测

Thu, 15 Jan 2026 00:00:00 +0000

IBM POWER8 微架构评测¶

背景¶

之前评测了很多 AMD64 和 ARM64 指令集的处理器，这次也来评测一下 PPC64LE 指令集的 IBM POWER8 微架构。

官方信息¶

IBM 关于 POWER8 微架构有如下公开信息：

IBM POWER8 processor core microarchitecture

下面分各个模块分别记录官方提供的信息，以及实测的结果。官方信息与实测结果一致的数据会加粗。

Benchmark¶

IBM POWER8 的性能测试结果见 SPEC。

前端¶

L1 ICache¶

官方信息：32 KB, 8-way set associative

为了测试 L1 ICache 容量，构造一个具有巨大指令 footprint 的循环，由大量的 nop 和最后的分支指令组成。观察在不同 footprint 大小下的 IPC：

超出 L1 ICache 容量后，IPC 从 6 降低到了 2.4。其中 6 IPC 来自于，IBM POWER8 在 ST 模式下每周期可以发射 8 条指令，但其中分支指令最多两条，非分支指令最多六条，所以执行 NOP 指令的 IPC 只能达到 6。

测试过程详见测试代码。

L1 ITLB (aka Instruction Effective to Real Address translation Table, IERAT)¶

官方信息：64-entry, fully associative

为了测试 L1 ITLB 的容量，构造 b 序列，每个 b 在一个单独的页（64KB 的页大小）中，观察 b 的性能：

可以看到明显的 64 pages 的拐点，对应了 64 entry 的 L1 ITLB。

测试过程详见测试代码。

BTB (Branch Target Buffer)¶

官方信息：无 BTB，总是通过 3 周期延迟的 Fetch + Decode(Branch Scan) 来得到分支指令的目的地址，靠 SMT 来填补流水线的气泡。

实测也是如此，对于连续执行多个 b 指令的情况，每条 b 指令都需要 3 周期。

Return Address Stack (aka Link Stack)¶

官方信息：32-entry(ST/SMT2)/16-entry(SMT4)/8-entry(SMT8) Link Stack per thread，也就是说总容量是 64，但每个线程只能用一部分

构造不同深度的调用链，测试每次调用花费的时间，得到如下测试结果：

可以看到 32 的拐点，对应的就是 ST 模式下 RAS 的大小。在同一个物理核上的其他三个逻辑核分别运行 stress，就测得 SMT4 模式下的 RAS 大小 16：

类似地，在其余七个逻辑核上分别运行 stress 负载，得到 SMT8 模式下的 RAS 大小为 8：

测试过程详见测试代码。

CBP (Conditional Branch Predictor)¶

官方信息：16K-entry LBHT, 16K-entry GBHT, 16K-entry GSEL，使用 21-bit GHV 记录全局分支历史，GSEL 用来选择由 LBHT 还是 GBHT 提供预测（通过 2-bit 饱和计数器），LBHT 采用 PC 索引，GBHT 和 GSEL 采用 PC+GHV 的哈希索引；此外，还支持把 conditional branch to +8 也就是只跳过一条指令的分支指令改写为 predication

IBP (Indirect Branch Predictor)¶

官方信息：256-entry local count cache, 512-entry global count cache，前者采用 PC 索引，后者采用 PC+GHV 的哈希索引，entry 内容是 30-bit 预测的目的地址加 2-bit 的 confidence（local count cache 的 entry 还有额外的 2-bit 饱和计数器用于选择 local 还是 global）

Dispatch¶

官方信息：按 Group 来 Dispatch，ST 模式下每周期一个 Group，每个 Group 最多 8 条指令（最多 2 条分支，最多 6 条非分支，且第二条分支必须是最后一条指令）；SMT 模式下，每周期从两个线程各 Dispatch 一个 Group，每个 Group 最多 4 条指令（最多 1 条分支，3 条非分支）

后端¶

ROB (aka Global Completion Table, GCT)¶

官方信息：28-entry，ST 模式下每个 entry 对应一个 Group；SMT 模式下每个 entry 对应两个来自同一个线程的 Group；所以最多容纳 28*8=224 条指令；Commit 的粒度是 Group，ST 模式下每周期 Commit 一个 Group，SMT 模式下每周期 Commit 两个 Group

拐点大致在 168 附近，因为每 6 条 NOP 指令对应一个 Group，所以只能容纳 28*6=168 条指令。

测试过程详见测试代码。

Register File¶

官方信息：一共可以有 106 个 Inflight 的 Rename，由 GPR（General Purpose Register）和 VSR（Vector and Scalar Register）共享；GPR 分为两组，每组 124-entry；VSR 分为两组，每组 144-entry；还有额外的两组 SAR（Software Architected Registers），一组用于 GPR，一组用于 VSR；CR（Condition Register）单独 Rename（32-entry mapper）到 64-entry Architected Register File；XER（fiXed-point Exception Register）Rename（30-entry mapper）到 32-entry Architected Register File；LR，CTR 和 TAR 单独 Rename（20-entry mapper）到 24-entry Architected Register File；FPSCR（Floating Point Status and Control Register）单独 Rename 到 28-entry buffer。

Issue Queue¶

官方信息：15-entry Branch Issue Queue，8-entry Condition Register Queue，64-entry UniQueue 用于其他指令；每周期最多 Issue 10 条指令：1x Branch, 1x Condition Register Logical, 2x Fixed Point, 2x Load/Store/Fixed Point to LSU, 2x Load/Fixed Point to LU, 2x Vector-Scalar to VSU/DFU(Decimal Floating point Unit)/Crypto

执行单元¶

官方信息：2 个定点计算流水线（FX），2 个 Load/Store 流水线（LS/FX），2 个 Load 流水线（L/FX），4 个双精度浮点流水线（或 8 个单精度浮点流水线），2 个向量流水线（VMX），1 个密码学流水线（Crypto），1 个分支流水线（Branch），1 个条件寄存器流水线（CR），1 个十进制浮点数流水线，共 16 个；其中 2 个 Load/Store 流水线和 2 个 Load 流水线还能执行简单的定点计算

Load Store Unit¶

官方信息：共有四个 Pipeline，L0/L1 仅 Load，LS0/LS1 可 Load/Store, 3 cycle load-to-use latency

Load/Store (Reorder) Queue¶

官方信息：40-entry（128 Virtual）Store Reorder queue，44-entry（128 Virtual）Load Reorder Queue

Load to use latency¶

官方信息：3-cycle latency

实测在下列的场景下可以达到 3 cycle:

ldr 4, 0(4): load 结果转发到基地址，无偏移
ldr 4, 8(4)：load 结果转发到基地址，有立即数偏移
ldx 4, 4, 6：load 结果转发到基地址，有寄存器偏移
ldx 4, 6, 4：load 结果转发到寄存器偏移

如果访存跨越了 128B 边界，则退化到 16 cycle。

L1 DCache¶

官方信息：64KB, 8-way set associative, 128B cache line, 4 read port, 1 write port，3 cycle load to use latency, store-through（写入会同时写 L1 DCache 和 L2），所以 store miss 不分配 cache line, 16 MSHR(aka Load Miss Queue)

构造不同大小 footprint 的 pointer chasing 链，测试不同 footprint 下每条 load 指令耗费的时间：

可以看到 64KB 出现了明显的拐点，对应的就是 64KB 的 L1 DCache 容量。第二个拐点在 512KB，对应的是 L2 Cache 的容量。第三个拐点是 3MB，对应的是 L1 DTLB 的容量：48*64KB=3MB。

测试过程详见测试代码。

Banking¶

官方信息：L1 DCache 由 16 个 macro 组成，每个 macro 是 16 个 bank，一共是 256 个 bank；sram 用的是 2R 或 1W，所以每个 bank 可以支持每周期 2R 或 1W

L1 DTLB (aka primary Data Effective-to-Real Address Translation, DERAT)¶

官方信息：48-entry(ST)/96-entry(SMT), fully associative

用类似测 L1 DCache 的方法测试 L1 DTLB 容量，只不过这次 pointer chasing 链的指针分布在不同的 64KB page 上，使得 DTLB 成为瓶颈：

可以看到 48 Page 出现了明显的拐点，对应的就是 48 的 L1 DTLB 容量。没有超出 L1 DTLB 容量前，Load to use latency 是 3 cycle。最终出现一个 18.8 cycle 的平台。

测试过程详见测试代码。

L2 DTLB (aka secondary Data Effective-to-Real Address Translation, DERAT)¶

官方信息：256-entry（ST 模式下全可见，SMT 模式下每个线程只有一半可见）, fully associative

继续扩大 DTLB 测试规模，可以看到在 256 处出现了新的拐点，其中 256 的地方出现周期数的骤降，是触发了 Linux 的大页合并功能：

关掉 THP(Transparent Huge Page) 后，周期数的骤降消失，256 的拐点之后周期数增加而不是减少：

测试过程详见测试代码。

L3 TLB¶

官方信息：2048-entry, 4-way set associative, 4 concurrent page table walk

继续扩大 DTLB 测试规模，在 2048 处出现了拐点，注意要关闭 THP，否则拐点会消失，因为实际上没有用到 2048 个页：

测试过程详见测试代码。

Prefetcher¶

官方信息：16-entry Stream Prefetcher，可以跨 4KB/64KB 页边界，用虚拟地址预取，可以预取到 L1/L2/L3

首先是连续访问若干个 128B cacheline，观察哪些被预取了进来：

可以看到后面有 12 个 cacheline 都被预取了，但是预取到了不同的 cache 层次，猜测距离越近的 4 个 cacheline 预取到 L1，更远的 2 个到 L2，其余的 6 个到 L3。

如果是访问了几个分立的缓存行，行为变成了 Next 3 Line：

测试过程详见测试代码。

2025 年我是怎么使用 AI 的

Thu, 25 Dec 2025 00:00:00 +0000

2025 年我是怎么使用 AI 的¶

前言¶

经常看我博客的读者应该能看出来，我研究的主要是计算机系统结构方向，特别是处理器的微架构，几乎没有涉及到 AI 的内容，我也确实不喜欢 AI 研究，仅关注但不参与。但今年，因为各种 AI 技术尤其是 LLM 的发展，我确实成为了很多 AI 技术的用户，可以说 2025 年是我正经大规模用 AI 的元年，所以在年末做一个简单的总结。

我不想在这里给大模型厂商打广告，所以相关的名字我都会按照某 PDF 的方法进行打码，有需要的朋友可以自行查看实际的内容。

Vibe Coding¶

首先的一个冲击来自于 Vibe Coding。我写代码也有大概十五年了，一直都是坚持自己写代码，但今年从一些朋友那里了解到一些 Vibe Coding 的效果以后，也自己尝试了一下，确实能够感受到 Vibe Coding 对写代码的巨大冲击，我的心态也出现了一定的变化。Vibe Coding 并不复杂，其实就是用一些 Coding 客户端，配上 LLM 加一些 Tool Call，使得 LLM 可以自己编写、测试和运行代码。目前随着 LLM 能力的变强，Vibe Coding 逐渐成为了一个可以负担得起且效果不错的东西。结合实际的使用，以及受朋友们的一些启发，我目前已经用它进行了一些 Vibe Coding 尝试，例如：

写一些简单的 MCP 服务器，例如 devdocs-mcp-server 把 devdocs.io 的文档通过 MCP 暴露给 LLM，让它可以精确读取标准库的文档，避免幻觉，还有让 LLM 可以读取波形文件的 waveform-mcp；
写一个 API 路由器 llm-api-router，可以在多个 API 提供商之间自动 Fallback，类似于本地版的 OpenRouter，但在这里主要是为了解决 Rate Limit 问题；
对已有代码的一些改进，比如实现 TODO，修复代码 BUG 等等；
给定提示词，让 LLM 用 Typst 或者 SVG 绘图，相比直接 AI 绘图，我更希望是可编辑的矢量图；
给定一张图，让 LLM 用 Typst 或者 SVG 复刻出来，然后再用 Vision LLM 识别绘制出来的图，观察内容是否和输入足够相似；
对于闭源的软件，让 LLM 自动逆向工程，得到一份关于内部实现的代码，甚至让它实现一份开源的等价实现。

目前给我的感觉是，LLM 借助各种 MCP Tooling，在很多事情上可以做的很好，但也有一些前提条件。第一是 LLM 需要有针对这个事情的知识，但如果它的知识停留在几年前，又做一些比较新的东西（例如 Typst 语法很多 LLM 就不会写），它就比较难写对；第二是，一定要给 LLM 反馈的路径，能够让它自产自纠自查，不然幻觉是很难避免的，一次写对的情况很少，有反馈和无反馈完全是两个表现；第三是，目前 LLM 做复杂事情需要大量的 Token，这就意味着 API 调用时间和开销都是不可忽略的因素，即使我用了比较便宜的 DeepSeek 模型，让 LLM 在后台跑几个小时，价格一样受不了。

举一个数据，我这个月在 DeepSeek 上已经花费了 200 多元，而这个月之前的所有时间加起来，也就不过 10 元。如果相同的 Token 数用在 Claude 上，这个价格不可想象。所以我也终于能理解，那些几百美刀一个月的订阅服务为啥有市场了。也是因为这个原因，我才会降本增效，通过订阅 GLM Coding Plan 去解决一些低频的 Vibe Coding 需求，但它的用量限制和并发限制都比较容易触发，所以才去 Vibe Coding 了一个 API 路由器，对于 GLM Coding Plan 用量以外的需求，再 Fallback 到 DeepSeek 上。

在使用 Vibe Coding 的过程中，我也有一些感受，就感觉我并不是在 Vibe Coding，而是在指挥一个水平飘忽不定的人在写代码。它有时候能精准地找到问题并写出正确的代码，有时候又注意力涣散，必须要我及时地打断它，让它按照我指定的方法去做。对于一些简单的代码，可能可以让它在后台跑，我去做一些别的事情，然后隔一段时间再看看它做得怎么样，有问题了，再提供及时的纠正。然后我就在想，这其实就是当领导吧，给钱让手下的人干活，不一定干的对，所以还得时不时地去纠正一下。某种意义来说，LLM 让每个人都有了成为领导，领导一群 LLM 干活的能力。我目前的工作流就是在 tmux 里挂几个 Qwen Code，连上几个配好的 MCP 服务器以及 API 路由器，然后时不时地看看它做的咋样，做得好就验收，让它 Git Commit，做得不好就让它改，时不时还得翻翻代码看看怎么帮它修。某种意义上，这和课后布置作业，给学生答疑也没啥本质上的区别，甚至 LLM 还更爱说话一点。

既然提到了答疑，也来谈谈教学。这种 Vibe Coding 的能力对于计算机教育的冲击无疑是巨大的，本来很多上课教的内容，AI 可以比较容易地完成，那学生可能就更倾向于让 AI 去完成了，换位思考一下，如果让我在 2025 年成为大一不会编程的新生，我也很难抵御这个诱惑。但是，锻炼代码和工程能力就欠缺了。这就对应一个很重要的问题，就是 AI 它到底是不是一种类似编译器、调试器或者编程语言的工具？我们说学生可以从编程语言而不是汇编学起，是因为它是一个很成熟很可靠的工具，你学会了高层次的工具就是会了，就可以用它做很多事情。AI 就很奇怪，它确实可以做很多事情，但又不总是可以完成，它好像是概率性的图灵完全，全看是否出现幻觉，所以它不是一个可靠的工具，但又是一个好用的工具。那么紧接的问题是，计算机教育，是要教出来真的会写代码的人，还是会用 AI 写代码就行？我目前没有答案，也不知道未来会怎么发展，只能慢慢走一步看一步了。但抛开计算机专业的教育，如果是对于计算机的通识教育，我觉得用 AI 写代码完全没有问题，毕竟对于更多人来说，能解决问题就可以，可不可靠，其实很多时候并不在考虑范围内。

我知道上面这段话可能会让读者有一些焦虑，但我觉得，它都这样了，就共存吧，反正焦虑也没有用，不如拥抱它。至于是否担心自己会被替代，我确实是不担心，目前它还不够专业，就算它再专业，它也没有身份证是吧。希望早日实现生产力极大富足，实现共同富裕，那就不用思考人是不是会被 AI 替代了。另外，高级编程语言出现了，那些写汇编的人去写高级语言，现在 Vibe Coding 来了，只是同一拨人又跑去做 Vibe Coding 罢了。持续学习才是最重要的。今年开始尝试 Vibe Coding 也是让我意识到，随着年龄增大，确实是没有当年对新事物接受得那么快了，这也让我有了一些反思，以后还是要多多接触新技术，一些过去的思维可能也要重新审视。

目前我对 Vibe Coding 的态度是，它不能替代我的思考，相反，我可以更多地思考一些更高层次的东西，而可以适当地把一些细节交给 AI。我也持续在自己写代码，特别是一些关键的部分，我还是无法信任完全由 AI 编写，毕竟它比人还懂得偷懒，经常写出来一些没有测试效果的测例，一看测例都过，一测全是 BUG。

我还会继续尝试和 LLM 协作，尽量保持高质量的代码产出，我认为这是用 Vibe Coding 的底线：用 AI 并不是写出烂代码的理由。以前我们有所谓的中文羞耻，觉得写了很多中文的项目的代码可能不靠谱，现在是所谓的 AI 羞耻，看到 README 里一堆 AI 生成的辞藻就觉得不靠谱一样。我们作为业内人士，还是要把事情做得漂亮，而不是让 AI 生成一个勉强能用的组装拖拉机就完事。

写作和语音输入¶

另一方面 AI 影响比较大的，其实是写作，包括日常的各种文字，比较正式的文档、论文甚至教材，不得不承认，AI 在写作方面确实是比我这种语文是考试弱项的偏科生要做得更好。我通常会自己编写一遍，然后交给 DeepSeek 来润色一遍，再在润色的基础上修改，保证我要表达的意思能够完全地被保留下来。一些小的人情世故，比如微信上和各种人打交道的措辞，网络上发送的邮件或者是 GitHub Issue 等等场合的客套话，AI 确实也是做得比自己好。但是，更完整的内容，或者整体架构上的把握，还是不会让 AI 完全去完成，因为能感觉到 AI 训练所使用的语料和自己的思维方式或者写作的习惯还是不一样的，我还是希望我写的东西能更加得有我的思考和劳动在里面，AI 只是一个让文字看起来更加通顺的工具，帮我纠正一些语法错误之类的。例如，我平时可能更习惯一些口语化的表达，能够让我很快地通过打字或者语音输入把我的脑子里的想法变成文字，然后再让 AI 改写成更加严肃的文字，像教材或者论文，这时 AI 就沦为了纯粹的文字风格改写或者语言翻译器。

既然提到了语音输入法，就不得不提，今年我用语音输入的比例大大提升了。其实语音输入法历史已经很久了，但是以前每次体验，都觉得效果不行，每次输入的有错误还得改，自己改正的时间，还不如自己打字来得快。所以一直以来我都是坚持在所有设备上都是 26 键打字用拼音输入的，当然包括手机，经过多年的练习，确实速度还不错，包括我也不喜欢麻烦别人在微信上听语音，所以我尽量都是用文字的。但今年感受下来，确实是不一样，感觉语音输入的准确率有了质的飞跃，能看到它先识别出一个音对字不对的状态，再纠正成正确的表达，还会提示你，这里可能是另一个词，如果你要修改的话，就直接点一下就行。有这个功能以后，我在手机上真的很多时候就直接用语音输入了，尤其是在一些不太正式的场合，对方也能够对那些少数的识别错误脑补的时候，语音输入确确实实替代了手机上打字。在电脑上，还是打字通常更快一些，但最近也尝试了一下智谱 AutoGLM 的输入法，感觉这种语音输入和 LLM 结合还挺有意思的，就是它们家的语音输入准确率还比不上鸿蒙 6 上的小艺输入法，要是二者的优点能够结合在一起就更好了，相信这一天并不遥远。

小结¶

目前想到的就这么多，其实 AI 还有很多场景可以用到，比如生成图片、视频和音乐等等，目前还没有太多的尝试，相信明年开始会逐渐接触，到时候再在年底写一个 AI 使用总结。总的下来，就是感叹自己也到了感慨科技进步的年纪了，十几年前学技术，虽然也能感觉到科技进步，但因为自己是从零开始，学的就是最新的科技，所以没有啥感觉。但这几年，不断地把新的输入和已有的积累进行对比，就能感觉明显到技术潮流和技术栈的移动，也能感觉到自己对新技术的接受度开始有了略微的下降，这值得让我警醒。以前，我们总是嘲笑大人不追求潮流，不去学习手机等新技术，我们在这个时代长大的人，可也不能犯这样的错误呀。

条件分支预测器逆向工程（以 Apple M1 Firestorm 为例）

Tue, 28 Oct 2025 00:00:00 +0000

条件分支预测器逆向工程（以 Apple M1 Firestorm 为例）¶

背景¶

去年我完成了针对 Apple 和 Qualcomm 条件分支预测器（Conditional Branch Predictor）的逆向工程研究，相关论文已发表在 arXiv 上，并公开了源代码。考虑到许多读者对处理器逆向工程感兴趣，但可能因其复杂性而望而却步，本文将以 Apple M1 Firestorm 为例，详细介绍条件分支预测器的逆向工程方法，作为对原论文的补充说明。

背景知识¶

首先介绍一些背景知识。要逆向工程条件分支预测器，需要先了解其工作原理。条件分支预测器的基本思路是：

条件分支的跳转行为（跳转或不跳转）通常是高度可预测的
预测器的输入包括条件分支的地址，以及近期执行的若干条分支的历史记录；输出则是预测该条件分支是否跳转

为了在硬件上实现这一算法，处理器会维护一个预测表，表中每一项包含一个 2 位饱和计数器，用于预测跳转方向。查表时，系统会对条件分支地址以及近期执行的分支历史进行哈希运算，使用哈希结果作为索引读取表项，然后根据计数器的值来预测分支的跳转方向。

（图源：CMU ECE740 Computer Architecture: Branch Prediction）

目前主流处理器普遍采用 TAGE 预测器，它在上述基础查表方法的基础上进行了重要改进：

观察到不同分支的预测所需的历史长度各不相同：有些分支无需历史信息即可准确预测，有些依赖近期分支的跳转结果，而有些则需要更久远的历史信息；
分支历史越长，可能的路径组合就越多，导致预测器训练过程变慢，训练期间的预测错误率较高，因此希望尽快收敛；
为满足不同分支对历史长度的需求，TAGE 设计了多个预测表，每个表使用不同长度的分支历史。多个表同时进行预测，当多个表都提供预测结果时（仅在 tag 匹配时提供预测），选择使用最长历史长度的预测结果。

（图源：Half&Half: Demystifying Intel's Directional Branch Predictors for Fast, Secure Partitioned Execution）

因此，要逆向工程处理器的条件分支预测器，需要完成以下工作：

确定分支历史的记录方式：通常涉及分支地址和目的地址，通过一系列移位和异或操作，将结果存储在寄存器中；
确定 TAGE 算法的具体实现：包括表的数量、每个表的大小、索引方式以及使用的分支历史长度。

分支历史的逆向¶

第一步是逆向工程处理器记录分支历史的方式。传统教科书方法使用一个寄存器，每当遇到条件分支时记录其跳转方向（跳转记为 1，不跳转记为 0），每个分支占用 1 bit。然而，现代处理器（包括 Intel、Apple、Qualcomm、ARM 和部分 AMD）普遍采用 Path History Register 方法。这种方法设计一个长度为 $n$ 的寄存器 $\mathrm{PHR}$，每当遇到跳转分支（包括条件分支和无条件分支）时，将寄存器左移，然后将当前跳转分支的地址和目的地址通过哈希函数映射，将哈希结果异或到移位寄存器中。用数学公式表示为：

$\mathrm{PHR}_{\mathrm{new}} = (\mathrm{PHR}_{\mathrm{old}} \ll \mathrm{shamt}) \oplus \mathrm{footprint}$

其中 $\mathrm{footprint}$ 是通过分支地址和目的地址计算得到的哈希值。接下来的任务是确定 $\mathrm{PHR}$ 的位宽、每次左移的位数，以及 $\mathrm{footprint}$ 的计算方法。

历史长度¶

首先分析这个更新公式：它将最近的 $\lceil n / \mathrm{shamt} \rceil$ 条跳转分支的信息压缩存储在 $n$ 位的 $\mathrm{PHR}$ 寄存器中。随着移位操作的累积，更早的分支历史信息对 $\mathrm{PHR}$ 的贡献最终会变为零。

第一个实验的目标是确定 $\mathrm{PHR}$ 能够记录多少条最近分支的历史。具体方法是构建一个分支历史序列：

第一个条件分支：以 50% 的概率随机跳转或不跳转；
中间插入若干条无条件分支；
最后一个条件分支：跳转方向与第一个条件分支相同。

接下来分析两种情况：

如果在预测最后一个条件分支时，分支历史 $\mathrm{PHR}$ 仍然包含第一个条件分支的信息，预测器应该能够准确预测最后一个条件分支的方向；
如果中间的无条件分支数量足够多，使得第一个条件分支的跳转信息对预测最后一个条件分支时的 $\mathrm{PHR}$ 没有影响，预测器只能以 50% 的概率进行正确预测。

通过构造上述程序，调整中间无条件分支的数量，并使用性能计数器统计分支预测错误率，可以找到一个临界点。当无条件分支数量超过这个阈值时，第二个条件分支的错误预测率会从 0% 上升到 50%。这个临界点对应 $\mathrm{PHR}$ 能够记录的分支历史数量，即 $\lceil n / \mathrm{shamt} \rceil$。

经过测试：

# 第一列：第二步插入的无条件分支数量加一 # 第二列到第四列：分支预测错误概率的 min/avg/max # 第五列：每次循环的周期数 size,min,avg,max,cycles 97,0.00,0.00,0.01,216.87 98,0.00,0.00,0.01,221.02 99,0.00,0.00,0.01,225.18 100,0.00,0.00,0.01,229.17 101,0.45,0.50,0.53,331.97 102,0.47,0.50,0.54,336.27 103,0.46,0.50,0.54,339.85

测试结果表明阈值为 100：在 Apple M1 Firestorm 上，最多可以记录最近 100 条分支的历史信息。

分支预测错误率是怎么测量的？

处理器内置了性能计数器，会记录分支预测错误次数。在 Linux 上，可以用 perf 子系统来读取；在 macOS 上，可以用 kpep 私有 API 来获取。我开源的代码中对这些 API 进行了封装，可以实现跨平台的性能计数器读取。更进一步，我们还逆向了 Qualcomm Oryon 的针对条件分支的预测错误次数的隐藏性能计数器，用于后续的实验。

分支地址 B 的贡献¶

接下来需要推测 $\mathrm{footprint}$ 的计算方法，即分支地址和目的地址如何参与 $\mathrm{PHR}$ 的更新过程。约定分支地址记为 $B$（Branch 的首字母），目的地址记为 $T$（Target 的首字母），用 $B[i]$ 表示分支地址从低到高第 $i$ 位（下标从 0 开始）的值，$T[i]$ 同理。假设 $\mathrm{footprint}$ 的每一位都由若干个 $B[i]$ 和 $T[i]$ 通过异或运算得到。

分支指令本身占用了多个字节，那么分支地址指的是哪一个字节的地址呢？

经过测试，AMD64 架构下，分支地址用的是分支指令最后一个字节的地址，而 ARM64 架构下，分支地址用的是分支指令第一个字节的地址。这大概是因为 AMD64 架构下分支指令是变长的，并且可以跨越页的边界；ARM64 则是定长的，并且不会跨越页的边界。

设计以下程序来推测某个 $B[i]$ 如何参与 $\mathrm{footprint}$ 的计算：

根据上面的分析，Apple M1 Firestorm 最多可以记录最近 100 条分支的历史信息，为了让 $\mathrm{PHR}$ 进入一个稳定的初始值，执行 100 个无条件分支；
设计两条分支指令，第一条是条件分支，按 50% 的概率跳或不跳；第二条是无条件分支；这两条分支的分支地址只在 $B[i]$ 上不同，其余的位都相同，目的地址相同；
执行若干条无条件分支，目的是把 $B[i]$ 对 $\mathrm{PHR}$ 的贡献向前移动；
执行一条条件分支指令，其跳转方向与第二步中条件分支的方向一致。

对应的代码如下：

// step 1. // 100 jumps forward goto jump_0; jump_0: goto jump_1; // ... jump_98: goto jump_99; jump_99:  // step 2. int d = rand(); // the follow two branches differ in B[i] // first conditional branch, 50% taken or not taken if (d % 2 == 0) goto target; // second unconditional branch else goto target; target:  // step 3. // variable number of jumps forward goto varjump_0; varjump_0: goto varjump_1; // ... varjump_k: goto last;  // step 4. // conditional branch last: if (d % 2 == 0) goto end; end:

第二步中条件分支跳转与否，会影响分支历史中 $B[i]$ 一个位的变化，它会经过哈希函数，影响 $\mathrm{footprint}$，进而异或到 $\mathrm{PHR}$ 中。通过调整第三步执行的无条件分支个数，可以把 $B[i]$ 对 $\mathrm{PHR}$ 的影响左移到不同的位置。如果 $B[i]$ 对 $\mathrm{PHR}$ 造成了影响，就可以正确预测最后一条条件分支指令的方向。当左移次数足够多时，$B[i]$ 对 $\mathrm{PHR}$ 的贡献会变为零，此时对最后一条条件分支指令的方向预测只有 50% 的正确率。在 Apple M1 Firestorm 上测试，得到如下结果：

横坐标 Dummy branches 指的是上面第三步插入的无条件分支的个数，纵坐标 Branch toggle bit 代表修改的是具体哪一个 $B[i]$，颜色对应分支预测的错误率，浅色部分对应最后一条分支只能正确预测 50%，深色部分对应最后一条分支总是可以正确预测。

从这个图可以得到什么信息呢？首先观察 $B[2]$ 对应的这一行，可以看到它确实参与到了 $\mathrm{PHR}$ 的计算中，但是仅仅经过 28 次移位，这个贡献就被移出了 $\mathrm{PHR}$，为了保留在 $\mathrm{PHR}$ 内，最多移动 27 次。类似地，在移出 $\mathrm{PHR}$ 之前，$B[3]$ 最多移动 26 次，$B[4]$ 最多移动 25 次，$B[5]$ 最多移动 24 次。

但实际上，这些 $B$ 是同时进入 $\mathrm{PHR}$ 的：这暗示它们对应 $\mathrm{footprint}$ 的不同位置。如果某个 $B[i]$ 出现在 $\mathrm{footprint}$ 更高位的地方，它也会进入 $\mathrm{PHR}$ 更高位，经过更少的移位次数就会被移出 $\mathrm{PHR}$；反之，如果 $B[i]$ 出现在 $\mathrm{footprint}$ 更低位的地方，它能够在 $\mathrm{PHR}$ 中停留更长的时间。

根据上面的实验，可见 $B[5], B[4], B[3], B[2]$ 参与到了 $\mathrm{footprint}$ 计算中，而 $B$ 的其他位则没有。但比较奇怪的是，$\mathrm{PHR}$ 理应可以记录最近 100 条分支的信息，但实际上只观察到了 28。所以一定还有其他的信息。

目的地址 T 的贡献¶

刚刚测试了 $B$，接下来测试 $T$ 各位对 $\mathrm{PHR}$ 的贡献，方法类似：

为了让 $\mathrm{PHR}$ 进入一个稳定的初始值，执行 100 个无条件分支；
设计一个间接分支，根据随机数，随机跳转到两个不同的目的地址，这两个目的地址只在 $T[i]$ 上不同，其余的位都相同，分支地址相同；
执行若干条无条件分支，目的是把 $T[i]$ 对 $\mathrm{PHR}$ 的贡献向前移动；
执行一条条件分支指令，其跳转方向取决于第二步中间接分支所使用的随机数。

对应的代码如下：

// step 1. // 100 jumps forward goto jump_0; jump_0: goto jump_1; // ... jump_98: goto jump_99; jump_99:  // step 2. int d = rand(); // indirect branch // the follow two targets differ in T[i] auto targets[2] = {target0, target1}; goto targets[d % 2]; target0: // add many nops target1:  // step 3. // variable number of jumps forward goto varjump_0; varjump_0: goto varjump_1; // ... varjump_k: goto last;  // step 4. // conditional branch last: if (d % 2 == 0) goto end; end:

在 Apple M1 Firestorm 上测试，得到如下结果：

为了测试 T[31]，岂不是要插入很多个 NOP，一方面二进制很大，其次还要执行很长时间？

是的，所以这里在测试的时候，采用的是类似 JIT 的方法，通过 mmap MAP_FIXED 在内存中特定位置分配并写入代码，避免了用汇编器生成一个巨大的 ELF。同时，为了避免执行大量的 NOP，考虑到前面已经发现 $B[6]$ 或更高的位没有参与到 $\mathrm{PHR}$ 计算中，所以可以添加额外的一组无条件分支来跳过大量的 NOP，它们的目的地址相同，分支地址低位相同，因此对 PHR 不会产生影响。对应的代码如下：

// step 1. // 100 jumps forward goto jump_0; jump_0: goto jump_1; // ... jump_98: goto jump_99; jump_99:  // step 2. int d = rand(); // indirect branch // the follow two targets differ in T[i] auto targets[2] = {target0, target1}; goto targets[d % 2]; target0: // skip over nops, while keeping B[5:2]=0 goto target2; // add many nops target1: goto target2;  target2:  // step 3. // variable number of jumps forward goto varjump_0; varjump_0: goto varjump_1; // ... varjump_k: goto last;  // step 4. // conditional branch last: if (d % 2 == 0) goto end; end:

由此我们终于找到了分支历史最长记录 100 条分支的来源：$T[2]$ 会经过 $\mathrm{footprint}$ 被异或到 $\mathrm{PHR}$ 的最低位，然后每次执行一个跳转分支左移一次，直到移动 100 次才被移出 $\mathrm{PHR}$。类似地，$T[3]$ 只需要 99 次就能移出 $\mathrm{PHR}$，说明 $T[3]$ 被异或到了 $\mathrm{PHR}[1]$。依此类推，可以知道涉及 $T$ 的 $\mathrm{footprint} = T[31:2]$，其中 $T[31:2]$ 代表一个 30 位的数，每一位从高到低分别对应 $T[31], T[30], \cdots, T[2]$。

小结¶

那么问题来了，前面测试 $B$ 的时候，移位次数那么少，明显少于 $T$ 的移位次数。这有两种可能：

硬件上只有一个 $\mathrm{PHR}$ 寄存器，$T[31:2]$ 被异或到 $\mathrm{PHR}$ 的低位，而 $B[5:2]$ 被异或到 $\mathrm{PHR}$ 的中间位置；
硬件上有两个 $\mathrm{PHR}$ 寄存器，其中一个是 100 位，它的 $\mathrm{footprint} = T[31:2]$，记为 $\mathrm{PHRT}$；另一个是 28 位，它的 $\mathrm{footprint} = B[5:2]$，记为 $\mathrm{PHRB}$。

经过后续的测试，基本确认硬件实现的是第二种。用数学公式表达：

$\mathrm{PHRT}_{\mathrm{new}} = (\mathrm{PHRT}_{\mathrm{old}} \ll 1) \oplus \mathrm{T}[31:2]$

$\mathrm{PHRB}_{\mathrm{new}} = (\mathrm{PHRB}_{\mathrm{old}} \ll 1) \oplus \mathrm{B}[5:2]$

有意思的是，在我的论文发表后不久，Apple 公开的专利 Managing table accesses for tagged geometric length (TAGE) load value prediction 中就出现了相关表述，证明了逆向结果的正确性。

按照这个方法，我还逆向工程了 Apple、Qualcomm、ARM 和 Intel 的多代处理器的分支历史记录方法，并进行了公开，供感兴趣的读者阅读，也欢迎读者将测试代码移植到更多处理器上，并贡献逆向工程的结果。

TAGE 表的逆向¶

接下来，我们将目光转向 TAGE 表的逆向工程。TAGE 表与缓存结构类似，也是一个多路组相连的结构，通过 index 访问若干路，然后对每一路进行 tag 匹配，匹配正确的那一路提供预测。TAGE 在预测时，输入是历史寄存器，即上面逆向得到的 $\mathrm{PHRT}$ 和 $\mathrm{PHRB}$，以及分支地址，目前这两个输入都是可控的。为了避免多个表同时提供预测，首先逆向工程使用分支历史最长的表的参数：它的容量是多少，index 如何计算，tag 如何计算，以及几路组相连。

如何确保使用分支历史最长的表提供预测呢？其实还是利用分支历史的特性，将随机数注入到 $PHRT$ 中，例如前面的间接分支，让两个目的地址只在 $T[2]$ 上不同：

// add some unconditional jumps to reset phr to some constant value // 100 jumps forward goto jump_0; jump_0: goto jump_1; // ... jump_98: goto jump_99; jump_99:  // inject int d = rand(); // indirect branch // the follow two targets differ in T[2] auto targets[2] = {target0, target1}; goto targets[d % 2]; target0: // add nop here target1:  // add some unconditional jumps to shift the injected bit left goto varjump_0; varjump_0: goto varjump_1; // ... varjump_k: goto last; last:

根据前面的分析，$T[2]$ 会被异或到 $\mathrm{PHRT}$ 的最低位上，每执行一次无条件分支，就左移一位。因此，通过若干个无条件分支，可以把 d % 2 这个随机数注入到 $\mathrm{PHRT}$ 的任意一位上。之后我们还会很多次地进行这种随机数的注入。

把随机数注入到 $\mathrm{PHRT}$ 高位以后，再预测一个根据随机数跳转或不跳转的分支，就可以保证它只能由使用分支历史最长的表来进行预测。

逆向工程 PC 输入¶

首先，我们希望推断 PC 如何参与到 index 或 tag 计算中。通常，TAGE 只会采用一部分 PC 位参与 index 或 tag 计算。换句话说，如果两个分支在 PC 上不同的部分没有参与 index 或 tag 计算，那么 TAGE 无法区分这两条分支。如果这两个分支跳转方向相反，并且用相同的 PHR 进行预测，那么一定会出现错误的预测。思路如下：

用 100 个无条件分支，保证 PHR 变成一个确定的值；
注入随机数 d % 2 到 PHRT，并移动到高位（例如 $PHRT[99]$），使用前面所述的方法；
执行两个条件分支，它们在分支地址上只有一位 $PC[i]$ 不同，它们的跳转条件相反，当第一个条件分支不跳转的时候，会执行第二个条件分支，它总是会跳转。

对应代码类似于：

// step 1. inject phrt int d = rand(); inject_phrt(d % 2, 99);  // step 2. a pair of conditional branches with different direction // their PC differs in one bit if (d % 2 == 0) goto end; if (d % 2 == 1) goto end;  end:

经过测试，PC 的输入是 $PC[18:2]$，其余的没有。

逆向工程相连度和 index 函数的 PC 输入¶

接下来是比较复杂的一步，同时逆向工程表的相连度和 index 函数的 PC 输入。这是因为这两部分是紧密耦合的：只有知道相连度，才能知道预测出来的分支数对应几个 set；但不知道 index 函数，又无法控制分支被分配到几个 set 中。首先，为了避免 PHR 的干扰，还是只注入一个随机数到 $PHRT[99]$ 上（事实上，$PHRT[99]$ 不是随便选择的，而是需要在 index 函数中，但通过测试可以找到满足要求的位）。其次，构造一系列分支，它们的地址满足：第 i 条分支（i 从 0 开始）的分支地址是 $i2^k$，其中 $k$ 是接下来要遍历的参数。当 $k=3$ 时，分支会被放到 0x0, 0x8, 0x10, 0x18, 0x20 等地址，涉及的 PC 位数随着分支数的增加而增加。接下来，我们分类讨论：

假如涉及的 PC 位都在 tag 中，没有出现在 index 中：那么这些分支都会被映射到同一个 set 内，一旦分支数量超出相连度，就会出现预测错误。
假如涉及的 PC 位有一部分出现在 index 中：那么每有一个 PC 位出现在 index 中，这些分支可以被分配到的 set 数量就翻倍，直到这些 set 都满了以后，才会出现预测错误。
假如涉及的 PC 位有一部分超出 PC 输入的范围（如前面逆向工程得到的 $PC[18:2]$）：那么超出输入的部分地址会被忽略，使得 set 内出现冲突。

实验结果如下图：

纵坐标就是上面的 $k$，横坐标是测试的条件分支数，颜色表示预测的错误率。当颜色从深色变浅，就说明出现了预测错误。观察：

$PC[3]$ 的情况下，只能预测 4 个分支，而 $PC[4]$ 或 $PC[5]$ 可以预测 8 个分支，暗示了四路组相连，然后 $PC[4]$ 和 $PC[5]$ 对应到了两个 set，所以能够正确预测 8 个分支。
$PC[6]$ 的情况下，可以预测 16 个分支，对应 4 个 set；后续 $PC[7]$ 和 $PC[8]$ 又可以预测 8 个分支，对应 2 个 set；意味着 $PC[6]$ 在 index 中，给 $PC[4]$ 和 $PC[5]$ 提供了两倍的 set；$PC[9]$ 在 index 中，给 $PC[6]$、$PC[7]$ 和 $PC[8]$ 提供了两倍的 set。
后续更高的 PC 位，没有受到 index 函数的影响，因此都是 4，直到最后超出 PC 输入范围。

这就说明它是四路组相连，PC[6] 和 PC[9] 参与到了 index 函数中。

下面给读者一个小练习，下面是在 Qualcomm Oryon 上测得的结果，可以看到噪声比较大，你能推断出它是几路组相连，有哪些 PC 参与到了 index 计算吗？

揭晓答案

四路组相连，$PC[6]$ 和 $PC[7]$ 参与到了 index 函数。

那么，这种测试是怎么构造的呢？即需要用相同的 PHR 去预测 $PC=i2^k$ 的多条分支。思路比较复杂：

首先执行一条间接分支，目的地址是 $i2^{k-1}$，那么它对 PHRT 的贡献是 $\mathrm{PHRT}_1 = (\mathrm{PHRT}_0 \ll 1) \oplus (i2^{k-3})$；
接下来，在 $i2^{k-1}$ 的位置，再执行一条直接分支，目的地址是 $i2^k$，那么它对 PHRT 的贡献是 $\mathrm{PHRT}_2 = (\mathrm{PHRT}_1 \ll 1) \oplus (i2^{k-2}) = (((\mathrm{PHRT}_0 \ll 1) \oplus (i2^{k-3})) \ll 1) \oplus (i2^{k-2}) = \mathrm{PHRT}_0 \ll 2$。

可见经过两步以后，PHRT 是保持不变的。针对 PHRB，只要 $i2^{k-1}$ 没有涉及 $PC[5:2]$，就能保证相同。那么如果 $k$ 足够小，也有办法：

首先执行一条间接分支，目的地址是 $i2^{k-1}$；
接下来执行大量的 NOP，使得 $B$ 的低位等于 0，然后再执行一条间接分支，目的地址是 $i2^k$。

因此我们总是可以通过两次分支，实现用相同的 PHR 预测不同 PC 上的多条分支。

逆向工程 tag 函数¶

接下来，进行 tag 函数的逆向工程。为了逆向工程 tag 函数，我们希望找到两个位在 tag 函数中有异或关系，那么如果这两个位同时设为 0，或者同时设为 1，其异或结果都等于 0，使得计算出来的 tag 函数相同，如果此时 index 还相同，那么预测器就无法区分这两种情况。

为了利用这一点，生成两个 0 到 1 的随机数 $k$ 和 $l$，分别把它们注入到 PC、PHRB 或者 PHRT 中，去预测一个条件分支，其跳转与否取决于 $k$ 的值（论文中有个小 typo）。如果 $k$ 和 $l$ 在 tag 函数中有异或关系，那么预测器总会预测错误。

实验结果大致如下，横纵坐标表示注入哪一个位，颜色代表预测错误率，深色意味着预测错误，也就是找到了一组异或关系：

其中有一些异或关系，因为对应的位在 index 中出现的缘故，导致没有显现出来。根据已知的异或关系外推，可以得到如下的 tag 计算公式：

PC[7] xor PHRT[0,12,...,96] xor PHRB[8,21]
PC[8] xor PHRT[1,13,...,97] xor PHRB[9,22]
PC[9] xor PHRT[2,14,...,98] xor PHRB[10,23,24]
PC[10] xor PHRT[3,15,...,87,99] xor PHRB[11,12,25]
PC[11] xor PHRT[4,16,...,88] xor PHRB[0,13,26]
PC[12] xor PHRT[5,17,...,89] xor PHRB[1,14,27]
PC[13] xor PHRT[6,18,...,90] xor PHRB[2,15]
PC[14] xor PHRT[7,19,...,91] xor PHRB[3,16]
PC[15] xor PHRT[8,20,...,92] xor PHRB[4,17]
PC[16] xor PHRT[9,21,...,93] xor PHRB[5,18]
PC[17] xor PHRT[10,22,...,94] xor PHRB[6,19]
PC[18] xor PHRT[11,23,...,95] xor PHRB[7,20]
PC[2:5]: 单独出现，不和其他位异或

那么，这里是怎么实现针对 PC、PHRT 和 PHRB 的注入的呢？针对 PHRT 的注入前面已经提到过，只需要一个间接分支：

// target differs in T[2] auto targets[2] = {target0, target1}; goto targets[d % 2]; target0: // add nop target1:

PHRB 的注入就比较复杂了，例如要注入 $B[2]=k$，我们需要进行三次分支的跳转：

第一次跳转，$B$ 相同，$T[2]=k$；
第二次跳转，$B[2]=k$，$T[3]=k$；
第三次跳转，$B[2]=B[3]=k$, $T$ 相同。

经过计算，可以发现前两次跳转对 PHRT 的抵消，第二次跳转的 $B[2]$ 与第三次跳转的 $B[3]$ 抵消，最后相当于只有最后一次跳转的 $B[2]=k$ 对 PHRB 产生了贡献。

最复杂的是 PC 的注入，这次需要分情况讨论：

第一种情况是，要注入的 PC 位比较高，具体来说，是 $PC[7]$ 或者更高的位数，此时我们可以很容易地避免引入对 PHRB 的贡献，因为它只考虑 $B[5:2]$：

第一次跳转，$B$ 相同，$T[6]=k$；
第二次跳转，$B[6]=k$ 但对 PHRB 没有贡献，$T[7]=k$。

那么两次跳转完以后，PHRB 不变，PHRT 的贡献被抵消掉，同时实现了 $PC[7]=k$ 的注入。

第二种情况是，要注入的正好是 $PC[6]$，继续用上面的方法会发现 PHRB 无法抵消，这时候，需要引入第三次跳转：

第一次跳转，$B$ 相同，$T[3]=k$；
第二次跳转，$B[3]=k$，$T[5]=T[4]=k$；
第三次跳转，$B[4]=k$，$T[6]=k$。

验算可以发现，PHRB 和 PHRT 的贡献全都被抵消，成功注入 $PC[6]=k$。

最后来看要注入的 PC 更低的情况，例如要注入 $PC[3]=k$，还是用三次跳转：

第一次跳转，$B$ 相同，$T[2]=k$；
第二次跳转，$B[2]=k$，$T[2]=T[3]=k$；
第三次跳转，$B[3]=k$，$T[3]=k$。

这样就成功注入 $PC[3]=k$。

那么 PC 的注入就完成了，只有 $PC[2]$ 没有找到合适的方法来注入。

逆向工程 index 函数¶

相连度和 tag 函数已知，接下来，让我们逆向工程最后的 index 函数。逆向工程的思路如下：

通过前面的逆向工程，发现 $PC[5:2]$ 是独立出现在 tag 函数中，并且没有出现在 index 中，所以我们可以构造出多个分支，它们的 tag 不同；
进一步，构造两组条件分支，每组都有四个条件分支，因为是四路组相连，如果这两组分别映射到两个 set 中，就可以正确预测；反之，如果被映射到同一个 set 中，就会预测错误；
和之前类似，向 PC、PHRB 或 PHRT 注入两个随机数 $k$ 和 $l$，然后预测两组共八个条件分支，这些条件分支的跳转方向都是 k xor l；
如果 $k$ 和 $l$ 注入的位同时出现在 index 函数中，但是没有异或关系，那么这八个条件分支可以正确地被预测；
如果 $k$ 和 $l$ 注入的位同时出现在 index 函数中，并且有异或关系，那么这八个条件分支会被映射到同一个 set 内，最多只能正确预测其中四个分支；
如果 $k$ 和 $l$ 注入的位至少出现一个在 tag 函数中，那么一个分支会在同一个 set 内占用两项，导致最多只能正确预测其中两个分支。

注入 $PC[9]$ 和 $PHRT[i]$ 的实验结果如下：

结合上面的讨论，可以知道：

PC[9] 和 PHRT[38] 和 PHRT[88] 有异或关系
PHRT[2]、PHRT[12]、PHRT[17]、PHRT[22]、PHRT[27]、PHRT[33]、PHRT[43]、PHRT[53]、PHRT[58]、PHRT[63]、PHRT[68]、PHRT[73]、PHRT[78]、PHRT[83]、PHRT[93] 在 index 中，但是和 PC[9] 没有异或关系

实际上还有 PHRT[7] 和 PHRT[48] 也在 index 中，但实际上测试的时候为了保证历史最长的表提供预测，还额外注入了 PHRT[99]，它与 PHRT[7] 和 PHRT[48] 有异或关系，所以在上图没有显现出来。

用类似的方法，去测试 PHRT 与 PHRT、PHRT 与 PHRB、PHRB 与 PHRB 之间的异或关系，就可以找到最终的 index 函数：

PHRT[2] xor PHRT[43] xor PHRT[93]
PHRT[7] xor PHRT[48] xor PHRT[99]
PHRT[12] xor PHRT[63] xor PHRB[5]
PHRT[17] xor PHRT[68] xor PHRB[10]
PHRT[22] xor PHRT[73] xor PHRB[15]
PHRT[27] xor PHRT[78] xor PHRB[20]
PHRT[33] xor PHRT[83] xor PHRB[25]
PHRT[38] xor PHRT[88] xor PC[9]
PHRT[53] xor PHRT[58] xor PHRB[0]
PC[6]

至此使用分支历史最长的表的逆向就完成了。接下来讨论一下，如何逆向工程分支历史更短的表。

逆向工程使用分支历史更短的 TAGE 表¶

前面提到，TAGE 在预测时，会选取提供预测的多个表中使用历史最长的那个表。那么，如果要测试使用历史第二长的表，应该怎么办呢？我们尝试了以下方法：

在使用历史更长的表里，预先插入一些表项，再去测试历史较短的表的情况，由于 TAGE 会利用它的 useful 计数器来进行新表项的分配，当历史更长的表里的表项的 useful 不为零，可以防止它被覆盖成新的内容，逼迫 TAGE 用历史更短的表来进行预测；
把多个表当成一个整体来考虑，比如在测试能够正常预测的分支数量的时候，得到的是多个表叠加的结果，再减去已知数量，可以得到历史比较短的表的信息；
在超出要测试的表的历史部分，注入大量随机数，例如要测试的一个表，只用到了历史的低 57 位，那就在更高的部分注入大量的随机数，那么历史最长的表能够提供预测的概率会非常小，从而逼迫 TAGE 用当前被测试的表来做预测。

通过这些方法，我们成功地逆向出了 Firestorm 的剩下的 TAGE 表的信息：

有兴趣的读者可以试着自己复现一下，看看能不能得到对应的实验结果，然后从结果中分析出硬件的参数。有意思的是，我们逆向出来 Qualcomm Oryon 的分支预测器的大小（6 个表一共 80KB 的空间），与官方在 Hot Chips 上公开的是一致的。

总结¶

我们通过一系列方法，实现了对 Apple M1 Firestorm 的条件分支预测器的逆向，并且成功地把它应用到了设计基本一样的 Qualcomm Oryon 处理器上，为后续的研究提供了基础。

引用文献¶

本博客近三个月来的访问数据观察

Thu, 09 Oct 2025 00:00:00 +0000

本博客近三个月来的访问数据观察¶

写在前面¶

这个博客自 2014 年更新至今，已走过近十一个年头，累计发布了四百多篇文章。出于好奇，我一直想了解哪些内容更受读者欢迎。五年前，我曾配置过 Google Analytics，但使用体验并不理想，于是转而自行部署了 rybbit 实例来收集访问数据。如今三个月过去，是时候与大家分享一些有趣的发现。

P.S. 如果你对数据收集有所顾虑，可以屏蔽对应的 analytics 脚本。

数据总览与趋势¶

首先来看这三个月内的整体访问情况与趋势：

访问量比我的预期要高一些。虽然这些年写了不少内容，但并没有刻意宣传，主要依赖搜索引擎推荐和读者的订阅与转发。从时间趋势上可以看出两个明显特点：

工作日访问量显著高于周末，通常为周末的两到三倍；
开学后工作日的访问量比暑假期间又高出一倍，但周末依然低迷。

由此推测，学生读者占比较高。结合国庆假期访问量的下降来看，国内读者仍是主力。下面是各国家与地区的访问分布：

值得一提的是，也有不少海外读者访问，看来近年来有意识地撰写英文内容确实产生了效果。

以 UTC+8 时区为基准，从访问时间分布中可以看出大家的作息习惯：

尽管大家可能习惯熬夜，但深夜阅读博客的并不多，多数访问集中在工作时间。

接下来是基于 User Agent 的统计。首先是浏览器分布：

不出所料，Chrome 及基于 Chromium 内核的浏览器占据主流，Firefox 和 Safari 占比不高。我本人目前也在使用 Firefox，希望它能持续发展，避免 Chrome 一家独大。

操作系统分布如下：

Windows 占比最高，macOS 次之。考虑到博客内容主要涉及计算机技术，这也大致反映了相关从业人员的偏好。

以下是访问次数较多的几篇文章：

这个排名有些出乎我的意料。这几篇文章在撰写时并未特别考虑入门读者的理解难度或内容的丰富性。或许是因为它们涉及的领域资料较少，因此在相关关键词搜索中容易被找到。这一点在 Google Search Console 中也得到了印证：

相比之下，一些精心准备的文章阅读量并不高，可能是因为所在领域已有较多优质内容，新文章难以脱颖而出。这也说明文章质量与阅读量之间并非简单的正比关系。

我还注意到一些重度用户，他们不仅阅读多篇文章，且停留时间较长：

未知用户一：阅读了分支预测、乱序执行、CPU 微架构分析等文章，推测可能是刚开始研究 CPU 微架构的同行；
未知用户二：浏览了 wishbone、乱序执行相关内容，猜测是学习清华计算机组成原理课程的学生；
未知用户三：从华为相关文章进入，随后浏览了其他内容，可能是搜索华为内容进入博客后，被其他文章吸引的读者；
未知用户四：阅读了 ARM/Samsung/Intel 等处理器微架构分析文章，又一位同行，但有一定基础，更加关注业界。

类似的例子还有不少，在此不一一列举。尽管 CPU 相关文章的阅读量不及软件类内容，但能得到这么多同行的关注，确实令人欣慰——毕竟这本身就是一个小众领域，不能奢求过高的阅读量。

通过这次数据分析，我收获了不少有趣的观察。未来可能会不定期更新类似内容，看看随着时间推移，是否会有新的发现。

最后，如果你对访问数据的收集感到不适，可以直接屏蔽 analytics 脚本（或许你的浏览器插件已经这样做了）。根据 rybbit 的官方说明，其信息收集方法较为尊重用户隐私，我也没有对代码进行任何修改，不放心的读者可以阅读 rybbit 的源码来审计。

P.S. 你能看出这篇文章是，我先写了一遍，然后让大模型润色的结果吗？

ARM 公版核微架构演进

Wed, 10 Sep 2025 00:00:00 +0000

ARM 公版核微架构演进¶

背景¶

ARM 公版核微架构的演进频繁，型号又比较多，相关信息散落在各种地方，为了方便查阅，在这里做一个收集。

2025 年¶

C1-Ultra¶

Arm の新しい CPU「C1」は 2 桁パーセントの性能アップ。電力効率も大幅改善
Inside Arm's New C1‑Ultra CPU: Double‑Digit IPC Gains Again!
- C1-Ultra: successor to Cortex X925
- Branch prediction: Additional tracking for local/per-PC history
- 33% increase in L1 I-Cache available bandwidth
- Out of order window size growth: Up to 25% growth, Up to ~2K instruction in flight
- 2x L1 data cache capacity (128KB)
- Data prefetchers: array-indexing coverage
Arm® C1-Ultra Core Technical Reference Manual
- Implementation of the Scalable Vector Extension (SVE) with a 128-bit vector length and Scalable Vector Extension 2 (SVE2)
- Implementation of the Scalable Matrix Extension (SME) and Scalable Matrix Extension 2 (SME2), and support for the C1-SME2 unit
- configure the L2 cache to be 2048KB or 3072KB
- A 64KB, 4-way set associative L1 instruction cache with 64-byte cache lines
- A fully associative L1 instruction Translation Lookaside Buﬀer (TLB) with native support for 4KB, 16KB, 64KB, and 2MB page sizes
- A 128KB, 4-way set associative cache with 64-byte cache lines
- A fully associative L1 data TLB with native support for 4KB, 16KB, and 64KB page sizes and 2MB and 512MB block sizes
- L2 cache is private to the core and can be configured to be 2MB 8-way set associative or 3MB 12-way set associative
- L1 instruction TLB, Fully associative, 128 entries
- L1 data TLB, Fully associative, 96 entries
- L1 Statistical Profiling Extension (SPE) TLB, Located in the SPE block, VA to PA translations of any page and block size, 1 entry
- L1 TRace Buﬀer Extension (TRBE) TLB, VA to PA translations of any page and block size, 1 entry
- L2 TLB, Shared by instructions and data, 8-way set associative, 2048 entries
- L1 instruction cache, 64KB, 4-way set associative, Virtually Indexed, Physically Tagged (VIPT) behaving as Physically Indexed, Physically Tagged (PIPT), Pseudo-Least Recently Used (LRU) cache replacement policy for L1, 32 bytes per cycle interface with L2
- L1 data cache, 128KB, 4-way set associative, Virtually Indexed, Physically Tagged (VIPT) behaving as Physically Indexed, Physically Tagged (PIPT), Re-Reference Interval Prediction (RRIP) replacement policy, 4×64-bit read paths and 4×64-bit write paths for the integer execute pipeline, 4×128-bit read paths and 4×128-bit write paths for the vector execute pipeline
- L2 cache, 2MB 8-way set associative with 4 banks or 3MB 12-way set associative with 4 banks, Physically Indexed, Physically Tagged (PIPT), Dynamic biased cache replacement policy, One CHI Issue E compliant interfaces with 256-bit read and write DAT channel widths

C1-Pro¶

Arm Lumex C1-Pro CPU Core: What You Need to Know
- C1-Pro: successor to Cortex-A725
- Larger direction predictor and branch history
- 2x capacity 0-cycle BTB
- 16x capacity 1-cycle BTB
- 50% more L1 Instruction TLB capacity
- Increase effective L1D cache bandwidth
- Lower latency L2 TLB hit
- New indirect prefetcher

2024 年¶

Cortex X925¶

Arm Unveils 2024 CPU Core Designs, Cortex X925, A725 and A520: Arm v9.2 Redefined For 3nm Archive
- Decode & Dispatch: 10-wide
- SIMD/FP execution: 6x 128b
- Integer ALU pipelines: 1- and 2-cycle operations
- Integer multiply execution: 4x versus Cortex-X4
- FP compare execution: 2x versus Cortex-X4
- >2x increase in SIMD/FP issue queues
- 2x increase in max instruction-window capacity（注：Cortex X4 是 384x2，推测 Cortex X925 是 768x2）
- Sign-extension instruction elimination
- Branch prediction: 2x instruction window size
- Instruction Fetch: 2x increase in L1 I$ available bandwidth, 2x increase in L1 iTLB size, Fold out unconditional direct branches
- 3 -> 4 load pipelines
- 2x increase in L1 D$ available bandwidth
- 25-40% in back-end OoO growth
Arm® Cortex-X925 Core Technical Reference Manual
- Implementation of the Scalable Vector Extension (SVE) with a 128-bit vector length and Scalable Vector Extension 2 (SVE2)
- configure the L2 cache to be 2048KB or 3072KB
- A 64KB, 4-way set associative L1 instruction cache with 64-byte cache lines
- A fully associative L1 instruction Translation Lookaside Buﬀer (TLB) with native support for 4KB, 16KB, 64KB, and 2MB page sizes
- A 64KB, 4-way set associative cache with 64-byte cache lines
- A fully associative L1 data TLB with native support for 4KB, 16KB and 64KB page sizes and 2MB and 512MB block sizes
- L2 cache is private to the core and can be configured to be 2MB 8-way set associative or 3MB 12-way set associative
- L1 instruction TLB, Caches entries at the 4KB, 16KB, 64KB, or 2MB granularity of Virtual Address (VA) to Physical Address (PA) mapping only, Fully associative, 128 entries
- L1 data TLB, Caches entries at the 4KB, 16KB, 64KB, 2MB, or 512MB granularity of VA to PA mappings only, Fully associative, 96 entries
- L2 TLB, Shared by instructions and data, VA to PA mappings for 4KB, 16KB, 64KB, 2MB, 32MB, 512MB, and 1GB block sizes, Intermediate Physical Address (IPA) to PA mappings for: 2MB and 1GB block sizes in a 4KB translation granule, 32MB block size in a 16KB translation granule, 512MB block size in a 64KB granule; Intermediate PAs (descriptor PAs) obtained during a translation table walk, 8-way set associative, 2048 entries
- L1 instruction cache, 64KB, 4-way set associative, Virtually Indexed, Physically Tagged (VIPT) behaving as Physically Indexed, Physically Tagged (PIPT)
- The Cortex®-X925 core supports the AArch64 prefetch memory instructions, PRFM PLI, into the L1 instruction cache or L2 cache
- L1 data cache, 64KB, 4-way set associative, Virtually Indexed, Physically Tagged (VIPT) behaving as Physically Indexed, Physically Tagged (PIPT), Re-Reference Interval Prediction (RRIP) replacement policy, 4×64-bit read paths and 4×64-bit write paths for the integer execute pipeline, 4×128-bit read paths and 4×128-bit write paths for the vector execute pipeline
- L2 cache, 2MB 8-way set associative with 4 banks or 3MB 12-way set associative with 4 banks, Physically Indexed, Physically Tagged (PIPT)
Arm® Cortex-X925 Core Software Optimization Guide

2023 年¶

Cortex X4¶

Arm Unveils 2023 Mobile CPU Core Designs: Cortex-X4, A720, and A520 - the Armv9.2 Family Archive
- Support for larger L2 (2M)
- Dispatch width: 10 instrs vs Cortex-X3 (6 instrs I$, 8 instrs Mop$)
- Overall pipeline depth (branch mispredict penalty): 10 cycles vs Cortex-X3 (11 cycles I$, 9 cycles Mop$)
- ALUs: 8 vs 6 (Cortex-X3)
- Branch units: 3 vs 2 (Cortex-X3)
- Integer MAC: 2 vs 1 (Cortex-X3)
- Pipelined FP divider / sqrt: Y vs N (Cortex-X3)
- MCA capacity: 320x2 -> 384x2
- 4th LS address generation: LS LS LD -> LS LD LD ST
- New L1 temporal data prefetcher
- Reduced L1 data bank conflicts
- Larger L1 data TLB: 48 -> 96
Arm® Cortex-X4 Core Technical Reference Manual

Cortex A720¶

Arm Unveils 2023 Mobile CPU Core Designs: Cortex-X4, A720, and A520 - the Armv9.2 Family Archive
- 11-cycle mispredict penalty, vs 12 (Cortex-A715)
- Improved 2-taken branch prediction
- Pipelined FDIV/FSQRT unit
- Faster transfers from Floating-Point/NEON/SVE2 to Integer
- Earlier deallocation of mops from Load-Store Issue Queues
- Lower latency for L2 cache hits, 9-cycle latency to access L2, vs 10 (Cortex-A715)
- Up to 2x memset(0) bandwidth in L2
- New L2 spatial-prefetch engine

2022 年¶

Cortex X3¶

Arm Unveils Next-Gen Flagship Core: Cortex-X3
- 50% larger L1 + L2 (new) BTB capacity
- 10x larger L0 BTB capacity
- New predictor dedicated for indirect branches
- Double return-stack capacity (32 entries)
- Mop cache 50% capacity (1.5K entries)
- Removed 1 pipeline stage in Mop Cache fetch, 10->9 cycles for a branch mispredict
- Increase decode bandwidth: 5->6
- Integer ALUs increase 4->6: 2->4 single-cycle (SX), 2 single-/multi-cycle (MX)
- ROB/MCQ: 288x2 -> 320x2
- Integer load bandwdith: 24B -> 32B
- Additional data prefetch engines: Spatial, Pointer/Indirect
Arm® Cortex‑X3 Core Technical Reference Manual

Neoverse V2¶

Arm Neoverse V2 platform: Leadership Performance and Power Efficiency for Next-Generation Cloud Computing, ML and HPC Workloads
- 6-wide/8-wide front-end
- 64KB ICache
- 320+ OoO window
- 8-wide dispatch
- 8-wide retire
- 2 LS + 1 LD / cycle
- 64KB DCache
- 6-ALU + 2-branch
- Quad 128-bit low latency SIMD datapath
- L2 10-cycle load-to-use, 128B/cycle, private L2 cache 1 or 2 MB
- Two predicted branches per cycle
- Predictor acts as ICache prefetcher
- 64kB, 4-way set-associative L1 instruction cache
- Two-level Branch Target Buffer
- 8 table TAGE direction predictor with staged output
- 10x larger nanoBTB
- Split main BTB into two levels with 50% more entries
- TAGE: 2x larger tables with 2-way associativity, Longer history
- Indirect branches: Dedicated predictor
- Fetch bandwidth: Doubled instruction TLB and cache BW
- Fetch Queue: Doubled from 16 to 32 entries
- Fill Buffer: Increased size from 12 to 16 entries
- Decode bandwidth: Increased decoder lanes from 5 to 6, Increased Decode Queue from 16 to 24 entries
- Rename checkpoints: Increased from 5 to 6 total checkpoints, Increased from 3 to 5 vector checkpoints
- Late read of physical register file – no data in IQs
- Result caches with lazy writeback
- Added two more single-cycle ALUs
- Larger Issue Queues, SX/MX: Increased from 20 to 22 entries, VX: Increased from 20 to 28 entries
- Predicate operations: Doubled predicate bandwidth
- Zero latency MOV; Subset of register-register and immediate move operations execute with zero latency
- Instruction fusion: More fusion cases, including CMP + CSEL/CSET
- Two load/store pipes + one load pipe
- 4 x 8B result busses (integer)
- 3 x 16B result busses (FP, SVE, Neon)
- ST to LD forwarding at L1 hit latency
- RST and MB to reduce tag and data accesses
- Fully-associative L1 DTLB with multiple page sizes
- 64kB 4-way set associative Dcache
- TLB Increased from 40 to 48 entries
- Replacement policy Changed from PLRU to dynamic RRIP
- Larger Queues: Store Buffer, ReadAfterRead, ReadAfterWrite
- Efficiency: VA hash based store to load forwarding
- Multiple prefetching engines training on L1 and L2 accesses: Spatial Memory Streaming, Best Offset, Stride, Correlated Miss Cache, Page
- New PF engines: Global SMS – larger offsets than SMS, Sampling Indirect Prefetch – pointer dereference, TableWalk – Page Table Entrie
- Private unified Level 2 cache, 8-way SA, 4 independent banks
- 64B read or write per 2 cycles per bank = 128B/cycle total
- 96-entry Transaction Queue
- Inclusive with L1 caches for efficient data and instruction coherency
- AMBA CHI interface with 256b DAT channels
- Capacity 2MB/8-way with latency of 1MB (10-cycle ld-to-use)
- Replacement policy 6-state RRIP (up from 4)
Hot Chips 2023: Arm’s Neoverse V2
Arm® Neoverse™ V2 Core Technical Reference Manual
Arm Neoverse V2 Software Optimization Guide

2021 年¶

Cortex X2¶

2020 年¶

Neoverse N2¶

Arm Neoverse N2: Arm’s 2nd generation high performance infrastructure CPUs and system IPs
- Branch Prediction, 2x 8 instrs (up to 2 taken per cycle), 2x improvement
- Nano BTB (0 cyc taken-branch bubble), 64 entry, 4x improvement
- Conditional branch direction state, 1.5x improvement
- Main BTB, 8K entry, 1.33x improvement
- Alt-Path Branch Prediction
- 64KB Instruction cache
- 1.5K entry Mop Cache
- 16-entry Fetch Queue, 1.33x improvement
- Fetch Width: 4 instr from i$, 5 instr from MOP$, Up to 1.5x improvement
- Early branch redirect: uncond + cond
- Decode width: 4 (I-cache) or 5 (Mop cache), Up to 1.25x improvement
- Branch predict up to 16-inst/cycle, 2-taken/cycle
- New Macro-op (MOP) cache with 1.5k entries
- 50% larger branch direction predicton
- 33% larger BTB with shorter average latency
- Early re-steering for conditional branches that miss the BTB
- Rename width: 5 instrs, 1.2x improvement
- Rename Checkpointing: Yes
- ROB size: 160+, 1.25x improvement
- ALUs: 4, 1.33x improvement
- Branch resolution: 2 per cycle, 2x improvement
- Overall Pipeline Depth: 10 cycles, 1.1x improvement
- 64KB L1 Data cache
- Private 512KB/1MB L2 Cache
- AGU: 2-LD/ST + 1 LD, 1.5x improvement
- L1 LD Hit bandwidth: 3x 16B/cycle, 1.5x improvement
- Store data B/W: 32B/cycle, 2x improvement
- L2 bandwidth: 64B read + 64B write, 2x improvement
- L2 transactions: 64, 1.3x improvement
- Data Prefetch Engines: Stride, spatial/region, stream, temporal
- Correlated Miss Caching (CMC) prefetching

Neoverse V1¶

SW defined cars: HPC, from the cloud to the dashboard for an amazing driver experience
- Faster run-ahead for prefetching into the I$ (2x32B bandwidth)
- 33% larger BTBs (8K entry)
- 6x nano BTB (96 entry), zero-cycle bubble
- 2x number of concurrent code regions tracked in front-end
- Introduction of Mop Cache, L0 decoded instruction cache (3K entry)
- high dispatch bandwidth, 8-instrs per cycle, 2x increase, I$ decode bandwidth increased from 4x to 5x
- Lower latency decode pipeline by 1 stage
- OoO window size, 2x+ ROB (256 entry + compression)
- Increase superscalar integer execution bandwidth, 1->2 Branch Execution, 3->4 ALU
- 2x vector/fp bandwidth, 2x256b – SVE (new), 4x128b – Neon/FP
- 3rd LD AGU/pipe (50% incr), LS LS LD
- LD/ST data bandwidth, LD: 2x16B -> 3x16B, LD (SVE): 2x32B, ST: 16B -> 32B (2x), broken out into separate issue pipes
- Number of outstanding external memory transactions (48->96)
- MMU capacity 1.2K->2K entry (67% incr)
- L2 latency reduced by 1 cycle for 1M (now 10cyc load to use)
- 11+ stage accordion pipeline
- 8-wide front-end / 15-wide issue
- Four 64-bit integer ALUs + two dedicated Branch units
- 2x 256-bit SVE datapaths
- 4x 128-bit Neon/FP datapaths
- 3x load / store addr
- 3x load data & 2x store data pipeline
- 8-wide Instruction fetch
- 5-8 wide decode / rename
- pipeline: P1 P2 F1 F2 DE1 RR RD I0 I1 I2 ...

Cortex X1¶

Arm's New Cortex-A78 and Cortex-X1 Microarchitectures: An Efficiency and Performance Divergence Archive
- 50% larger L0-BTB capacity, 96 entries, zero-cycle bubble taken-branch latency
- Increased fetch bandwidth available, 5 instruction fetch from the instruction cache, 8 Mop fetch from the Mop cache
- 2x Mop cache capacity over Cortex-A77, 3K entries
- 33% increase in dispatch bandwidth, up to 8-instr/cycle
- 40% increase in out-of-order window size, 224 entry instruction window
- 2x FP/ASIMD execution bandwidth, 4x128b total bandwidth
- Doubling available L1-D, L2 bandwidth
- Doubleing of maximum L2 capacity
- Up to 33% increase in window growth for in-flight loads and stores
- 66% larger L2-TLB capacity, 2K entries
Arm Cortex-X1: The First From The Cortex-X Custom Program
Arm® Cortex®‑X1 Core Technical Reference Manual

Cortex A78¶

Arm's New Cortex-A78 and Cortex-X1 Microarchitectures: An Efficiency and Performance Divergence Archive
- Expand prediction support to 2 taken branches per cycles
- Additional IMUL bandwidth, up to 2x per cycle
- 50% increase in load bandwidth over Cortex-A77, additional load AGU / result
- Double store-data bandwidth, 32B per cycle
- Double L2 interface bandwidth

2019 年¶

Neoverse N1¶

The Arm Neoverse N1 Platform: Building Blocks for the Next-Gen Cloud-to-Edge Infrastructure SoC
- 4-wide front-end
- dispatching/committing up to 8 instructions per cycle
- three ALUs, a branch execution unit, two Advanced SIMD units, and two load/store execution units
- minimum misprediction penalty is 11-cycle
- fetch up to 4 instructions per cycle
- large 6K-entry main branch target buffer with 3-cycle access latency
- 64-entry micro-BTB and a 16-entry nano-BTB
- 12-entry fetch queue
- fully associative 48-entry instruction TLB
- 4-way set-associative 64KB I-cache
- I-cache can deliver up to 16B of instructions per cycle
- up to 8 outstanding I-cache refill requests
- 4-wide decoder
- renaming unit can receive up to 4 macro-ops per cycle
- up to 8 micro-operations can be dispatched into the out-of-order engine each cycle
- The commit queue can track up to 128 micro operations
- up to 8 micro-ops can be committed per cycle
- a distributed issue queue with more than 100 micro-operations
- 4 integer execution pipelines, 2 load/store pipelines, and 2 Advanced SIMD pipelines
- 64kB 4-way set associative L1 data cache, 4-cycle load to use latency and a bandwidth of 32 bytes/cycle
- The core-private 8-way set associative L2 cache is up to 1MB in size and has a load-to-use latency of 11 cycles
- can also be configured with smaller L2 cache sizes of 256kB and 512kB with a load-to-use latency of 9 cycles
- L2 cache connects to the system via an AMBA 5 CHI interface with 16-byte data channels
- L3 cluster cache can be up to 2MB, with a load-to-use latency ranging between 28 and 33 cycles
- up to 256MB of shared system-level cache

AMD Zen 3 的 BTB 结构分析

Tue, 08 Jul 2025 00:00:01 +0000

AMD Zen 3 的 BTB 结构分析¶

背景¶

在之前，我们分析了 AMD Zen 1 和 AMD Zen 2 的 BTB，接下来分析它的再下一代微架构：2020 年发布的 AMD Zen 3 的 BTB，看看 AMD 的 Zen 系列的 BTB 是如何演进的。

官方信息¶

AMD 在 Software Optimization Guide for AMD EPYC™ 7003 Processors (Publication No. 56665) 中有如下的表述：

The branch target buffer (BTB) is a two-level structure accessed using the fetch address of the previous fetch block.

Zen 3 的 BTB 有两级，相比 Zen 1 和 Zen 2 少了一级。BTB 是用之前 fetch block 的地址去查询，而不再是当前 fetch block 的地址。用当前 fetch block 的地址查询 BTB 很好理解，要寻找某个地址开始的第一个分支，就用这个地址去查询 BTB，Zen 1 和 Zen 2 都是如此；用之前 fetch block 的地址，则是用更早的信息，去获取当前 fetch block 的信息，例如：

entrypoint1:  jmp entrypoint2  entrypoint2:  # what's the first branch after entrypoint2?

在查询从 entrypoint2 开始的第一条分支指令的时候，如果使用当前 fetch block，就是用 entrypoint2 的地址去查询，那就必须等到前面 jmp entrypoint2 指令的目的地址被计算得出；如果使用之前 fetch block，就是用 entrypoint1 的地址去查询，不用等到 jmp entrypoint2 指令的目的地址被计算得出。因此，如果用之前 fetch block，可以更早地进行 BTB 的访问，从而减少 BTB 的延迟，或者在相同延迟下获得更大的容量。但是，代价是：

从 entrypoint1 跳转到的 fetch block 可能有多个，例如最后一条是间接分支指令，那就需要找到正确的分支的信息
可能会从不同的地址跳转到 entrypoint2 这个 fetch block，因此它的信息可能会保存多份

Each BTB entry can hold up to two branches if the last bytes of the branches reside in the same 64-byte aligned cache line and the first branch is a conditional branch.

Zen 3 的 BTB entry 有一定的压缩能力，一个 entry 最多保存两条分支，前提是两条分支在同一个 64B 缓存行中，并且第一条分支是条件分支。这样，如果第二条分支是无条件分支，分支预测的时候，可以根据第一条分支的方向预测的结果，决定要用哪条分支的目的地址作为下一个 fetch block 的地址。虽然有压缩能力，但是没有提到单个周期预测两条分支，所以只是扩大了等效 BTB 容量。和 Zen 1、Zen 2 一样。

L1BTB has 1024 entries and predicts with zero bubbles for conditional and unconditional direct branches, and one cycle for calls, returns and indirect branches.

Zen 3 的第一级 BTB 可以保存 1024 个 entry，但不确定这个 entry 是否可以保存两条分支，也不确定这个 entry 数量代表了实际的 entry 数量还是分支数量，后续会做实验证实；针对条件和无条件直接分支的预测不产生气泡，意味着它的延迟是一个周期。相比 Zen 2 容量翻倍，且延迟降低一个周期，猜测和使用 previous fetch block 有关。

L2BTB has 6656 entries and creates three bubbles if its prediction differs from L1BTB.

Zen 3 的第二级 BTB 可以保存 6656 个 entry，但不确定这个 entry 是否可以保存两条分支，也不确定这个 entry 数量代表了实际的 entry 数量还是分支数量，后续会做实验证实；预测会产生三个气泡，意味着它的延迟是四个周期。

简单整理一下官方信息，大概有两级 BTB：

1024-entry L1 BTB, 1 cycle latency
6656-entry L2 BTB, 4 cycle latency

相比 Zen 1 和 Zen 2 有比较大的不同：去掉了原来很小的 L0 BTB，扩大了 L1 BTB，同时延迟缩短了一个周期；虽然 L2 BTB 有所缩小，但是延迟也变短了一个周期。

下面结合微架构测试，进一步研究它的内部结构。

微架构测试¶

在之前的博客里，我们已经测试了各种处理器的 BTB，在这里也是一样的：按照一定的 stride 分布无条件直接分支，构成一个链条，然后测量 CPI。

考虑到 Zen 3 的 BTB 可能出现一个 entry 保存两条分支的情况，并且还对分支的类型有要求，因此下面的测试都会进行四组，分别对应四种分支模式：

uncond：所有分支都是无条件分支：uncond, uncond, uncond, uncond, ...
cond：所有分支都是条件分支：cond, cond, cond, cond, ...
mix (uncond + cond)：条件分支和无条件分支轮流出现，但 uncond 在先：uncond, cond, uncond, cond, ...
mix (cond + uncond)：条件分支和无条件分支轮流出现，但 cond 在先：cond, uncond, cond, uncond, ...

虽然 Zen 3 使用 previous fetch block 来访问 BTB，但在这几种分支模式中，使用 previous fetch block 还是访问 current fetch block，结果都是唯一的，所以并不会对结果带来影响。

stride=4B¶

首先是 stride=4B 的情况：

可以看到，图像上出现了三个比较显著的拐点：

第一个拐点是 4 条分支，CPI=1，对应 L1 BTB，没有达到完整容量，可能是因为分支太过密集
第二个拐点是 2048 条分支，CPI=3.6；第三个拐点是 4096 条分支，CPI=4/4.2/4.4

Zen 3 在 stride=4B 的情况下 L1 BTB 表现比较一般，应该是牺牲了高密度分支下的性能；而主要命中的是 L2 BTB，在不同的分支模式下，测出来差不多的结果。为了验证这一点，统计了如下的性能计数器（来源：Processor Programming Reference (PPR) for AMD Family 19h Model 21h, Revision B0 Processors）：

PMCx08B [L2 Branch Prediction Overrides Existing Prediction (speculative)] (Core::X86::Pmc::Core::BpL2BTBCorrect)

它代表了 L2 BTB 提供预测（准确地说，L2 BTB 提供了预测且和 L1 BTB 提供的预测结果不同，覆盖了 L1 BTB 的预测结果）的次数，当分支数不大于 4 的时候，这个计数器的值约等于零；此后快速上升，说明后续都是 L2 BTB 在提供预测。

更进一步观察，发现 2048 到 4096 的 CPI 上升，来自于 L1 BTB 完全失效：2048 条分支时，L1 BTB 还能提供约 10% 的预测，所以 CPI=0.1*1+0.9*4=3.7，但到 4096 条分支的时候，完全由 L2 BTB 提供分支，此时 CPI=4。

超过 4096 以后，则 L2 BTB 也开始缺失，出现了译码时才能发现的分支，如果这是一条 uncond 分支，那么会在译码时回滚，这一点可以通过如下性能计数器的提升来证明（来源：Processor Programming Reference (PPR) for AMD Family 19h Model 21h, Revision B0 Processors）：

PMCx091 [Decode Redirects] (Core::X86::Pmc::Core::BpDeReDirect): The number of times the instruction decoder overrides the predicted target.

但在 L2 BTB 缺失后，如果译码器发现了 cond 分支，会把它预测为不跳转，所以要等到执行才能发现分支预测错误。这就导致了 cond 模式下 L2 BTB 溢出时 CPI=16，而 uncond 模式下 L2 BTB 溢出时 CPI=12，提前在译码阶段发现了 uncond 分支并纠正。

但译码器的纠正能力不是万能的：假如它首先发现了一条 cond 分支，在它其后又发现了一条 uncond 分支，它会用 uncond 分支去纠正，但实际上前面的 cond 分支会跳转，所以此时译码器纠正也无法提升性能，即使 BpDeReDirect 计数器的值看起来很大。

stride=8B¶

接下来观察 stride=8B 的情况：

第一个台阶在所有分支模式下都是 1024 个分支，CPI=1，对应 1024-entry 的 L1 BTB
第二个台阶不太明显，但是在 4096 附近在所有分支模式下都是一个拐点，CPI=4，对应 L2 BTB；在 mix (uncond + cond) 模式下，超过 4096 分支后 CPI 缓慢上升，到 6144 条分支 CPI=4.25，到 6656 条分支 CPI=4.85，之后 CPI 快速上升；在 mix (cond + uncond) 模式下，到 5888 条分支 CPI=5。

L2 BTB 的容量不太确定，超过 4096 后需要一个 entry 保存两条分支才能获得更多容量，但也带来了一定的额外的延迟。与此同时 4096 也对应了 32KB ICache 的容量，这会对分析带来干扰。

从 BpDeReDirect 计数器来看，uncond 分支模式下，当分支数量超过 4096 后，L2 BTB 从 4096 时无缺失，之后缺失快速提升，说明此时 L2 BTB 容量确实是 4096。在 mix (cond + uncond) 模式下，分支数超过 4096 时，BpDeReDirect 计数器略微上升，直到 6144 条分支后才有明显的上升。

stride=16B¶

继续观察 stride=16B 的情况：

相比 stride=8B，L1 BTB 的行为没有变化。4096 对应的 CPI 有所下降，从 BpL2BTBCorrect 性能计数器可以发现是 L1 BTB 起了一定的作用。在 mix (cond + uncond) 模式下，直到 5632 条分支还维持了 CPI=3.25，之后 CPI 缓慢上升，到 6656 条分支时 CPI=3.75，到 6912 条分支时 CPI=4。

CPI=3.25 可能是来自于 1 和 4 的加权平均：25% 的时候是 1 周期，75% 的时候是 4 周期，平均下来就是 1*0.25+4*0.75=3.25。这意味着 L1 BTB 还要保持 25% 的命中率。观察 BpL2BTBCorrect 性能计数器，发现它的取值等于 75% 的分支执行次数，意味着 L1 BTB 确实提供了 25% 的预测，L2 BTB 提供了剩下 75% 的预测。这一点是挺有意思的，意味着 L1 BTB 可能采用了一些对这种循环访问模式友好的替换策略：朴素的 LRU（或类 LRU）替换策略会导致 L1 BTB 出现 100% 缺失。

stride=32B¶

继续观察 stride=32B 的情况：

相比 stride=16B，L1 BTB 的行为没有变化，但是出现了一些性能波动。所有分支模式下，L2 BTB 的拐点都出现在 5120，但性能波动比较大，mix (cond + uncond) 模式下的 CPI 达到了 4.6。通过 BpDeReDirect 性能计数器的变化，可以确认这个拐点确实是来自于 L2 BTB 的缺失。

前面提到，译码器的纠正能力可能会给出错误的答案，在 stride=32B 时，就会出现一个很有意思的现象：

超出 L2 BTB 容量后，mix (uncond + cond) 模式下 BpDeReDirect 占分支数量的 50%
超出 L2 BTB 容量后，mix (cond + uncond) 模式下 BpDeReDirect 占分支数量的接近 100%

解释起来也并不复杂：stride=32B 的情况下，一个 64B cacheline 只有两条分支，那么：

mix (uncond + cond) 模式下，第一条分支是 uncond，译码器会发现并 redirect；第二条分支是 cond，译码器会无视它，不进行 redirect；所以最后是 50% 的 redirect 比例
mix (cond + uncond) 模式下，第一条分支是 cond，译码器会看到后面的 uncond 分支并 redirect；第二条分支是 uncond，译码器会发现并 redirect；所以最后是接近 100% 的 redirect 比例

顺带一提，uncond 模式下的 BpDeReDirect 占分支数量的接近 100%，cond 模式下的 BpDeReDirect 占分支数量的 0%，都是符合预期的。

stride=64B¶

继续观察 stride=64B 的情况：

相比 stride=32B，L1 BTB 的容量减半，达到了 512。之后出现了比较明显的性能波动，但四种分支模式下，拐点依然都是出现在 5120 条分支的位置。通过 BpDeReDirect 性能计数器的变化，可以确认这个拐点确实是来自于 L2 BTB 的缺失。由于 uncond 模式下，BTB sharing 不会工作，意味着 L2 BTB 至少有 5120 个 entry。

stride=128B¶

继续观察 stride=128B 的情况：

相比 stride=64B，L1 BTB 的容量进一步减小，达到了 256；L2 BTB 的性能依然波动剧烈，但四种分支模式下，拐点依然都是出现在 5120 条分支的位置。

考虑到 5120 这个拐点频繁出现，认为 L2 BTB 在不考虑 BTB entry sharing 的情况下，实际容量应该是 5120。那么剩下的 1536 个分支就是来自于压缩。

小结¶

测试到这里就差不多了，更大的 stride 得到的也是类似的结果，总结一下前面的发现：

L1 BTB 是 1024-entry，1 cycle latency，容量随着 stride 变化，大概率是 PC[n:5] 这一段被用于 index，使得 stride=64B 开始容量不断减半
L2 BTB 是 5120-entry，4 cycle latency；其中有 1536 个 entry 最多保存两条分支，前提是这两条分支在同一个 cacheline 当中，并且第一条是 cond，第二条是 uncond

Zen 1 到 Zen 3 的 BTB 的对比¶

下面是对比表格：

uArch	AMD Zen 1	AMD Zen 2	AMD Zen 3
L0 BTB size	4+4 branches	8+8 branches	N/A
L0 BTB latency	1 cycle	1 cycle	N/A
L1 BTB size	256 branches	512 branches	1024 branches
L1 BTB latency	2 cycles	2 cycles	1 cycle
L2 BTB size w/o sharing	2K branches	4K branches	5K branches
L2 BTB size w/ sharing	4K branches	7K branches	6.5K branches
L2 BTB latency	5 cycles	5 cycles	4 cycles
Technology Node	14nm	7nm	7nm
Release Year	2017	2019	2020

Zen 3 在 Zen 2 的基础上，没有更换制程，而是通过 previous fetch block 的方式，减少 L1 BTB 的延迟到 1 cycle，顺带去掉了 L0 BTB。L2 BTB 的大小进行了调整，减少了共享的部分，而增加了不限制分支类型的 BTB entry 数量，同时减少了一个周期的延迟，不确定这个延迟是单纯通过优化容量实现的，还是说也依赖了 previous fetch block 的方法来减少周期，更倾向于是后者，因为 L1 和 L2 BTB 都减少了一个周期的延迟。

如果按照 Intel 的 tick-tock 说法，那么 Zen 2 相比 Zen 1 是一次 tick，更换制程，微架构上做少量改动；Zen 3 相比 Zen 2 是一次 tock，不更换制程，但是在微架构上做较多改动。Zen 4 是 2022 年发布的，使用的是 5nm 制程；Zen 5 是 2024 年发布的，使用的是 4nm 制程。总结一下规律，AMD 会花费两年的时间来升级制程，并且实际上，Zen 4 和 Zen 5 不仅更新了制程，还在前端微架构上有较大的改动。

AMD Zen 3 和 ARM Neoverse V1 的 BTB 的对比¶

AMD Zen 3 和 ARM Neoverse V1 都是在 2020 发布的处理器，下面对它们进行一个对比：

uArch	AMD Zen 3	ARM Neoverse V1
L1/Nano BTB size	1024 branches	48*2 branches
L1/Nano BTB latency	1 cycle	1 cycle
L1/Nano BTB throughput	1 branch	1-2 branches
L2/Main BTB size w/o sharing	5K branches	4K*2 branches
L2/Main BTB size w/ sharing	6.5K branches	4K*2 branches
L2/Main BTB latency	4 cycles	2 cycles
L2/Main BTB throughput	1 branch	1-2 branches
Technology Node	7nm	5nm

虽然 AMD Zen 3 通过 previous fetch block 优化，实现了 1 cycle 下更大的 L1 BTB，但这一点在 2022 年发布的 ARM Neoverse V2 上被追赶：ARM Neoverse V2 的 L1/Nano BTB 也做到了 1024 的容量。

在 L2 BTB 方面，ARM Neoverse V1 占据了领先，无论是延迟还是容量；当然了，ARM Neoverse V1 的制程也要更加领先，ARM 采用的 5nm 对比 AMD 采用的 7nm。

更进一步，ARM Neoverse V1 实现了一个周期预测两条分支，即 two taken（ARM 的说法是 two predicted branches per cycle），在 2 cycle 的 Main BTB 上可以实现接近 AMD Zen 3 的 L1 BTB 的预测吞吐。AMD 也不甘示弱，在 2022 年发布的 AMD Zen 4 处理器上，实现了 two taken。

AMD Zen 2 的 BTB 结构分析

Tue, 08 Jul 2025 00:00:00 +0000

AMD Zen 2 的 BTB 结构分析¶

背景¶

在之前，我们分析了 AMD Zen 1 的 BTB，接下来分析它的下一代微架构：2019 年发布的 AMD Zen 2 的 BTB，看看 AMD 的 Zen 系列的 BTB 是如何演进的。

官方信息¶

AMD 在 Software Optimization Guide for AMD EPYC™ 7002 Processors (Publication No. 56305) 中有如下的表述：

The branch target buffer (BTB) is a three-level structure accessed using the fetch address of the current fetch block.

Zen 2 的 BTB 有三级，是用当前 fetch block 的地址去查询，和 Zen 1 一样。

Each BTB entry includes information for branches and their targets. Each BTB entry can hold up to two branches if the branches reside in the same 64-byte aligned cache line and the first branch is a conditional branch.

Zen 2 的 BTB entry 有一定的压缩能力，一个 entry 最多保存两条分支，前提是两条分支在同一个 64B 缓存行中，并且第一条分支是条件分支。这样，如果第二条分支是无条件分支，分支预测的时候，可以根据第一条分支的方向预测的结果，决定要用哪条分支的目的地址作为下一个 fetch block 的地址。虽然有压缩能力，但是没有提到单个周期预测两条分支，所以只是扩大了等效 BTB 容量。和 Zen 1 一样。

L0BTB holds 8 forward taken branches and 8 backward taken branches, and predicts with zero bubbles

Zen 2 的第一级 BTB 可以保存 8 条前向分支和 8 条后向分支，预测不会带来流水线气泡，也就是说每个周期都可以预测一次。相比 Zen 1 容量翻倍。

L1BTB has 512 entries and creates one bubble if prediction differs from L0BTB

Zen 2 的第二级 BTB 可以保存 512 个 entry，但不确定这个 entry 是否可以保存两条分支，也不确定这个 entry 数量代表了实际的 entry 数量还是分支数量，后续会做实验证实；预测会产生单个气泡，意味着它的延迟是两个周期。相比 Zen 1 容量翻倍。

L2BTB has 7168 entries and creates four bubbles if its prediction differs from L1BTB.

Zen 2 的第三级 BTB 可以保存 7168 个 entry，但不确定这个 entry 是否可以保存两条分支，也不确定这个 entry 数量代表了实际的 entry 数量还是分支数量，后续会做实验证实；预测会产生四个气泡，意味着它的延迟是五个周期。

简单整理一下官方信息，大概有三级 BTB：

(8+8)-entry L0 BTB, 1 cycle latency
512-entry L1 BTB, 2 cycle latency
7168-entry L2 BTB, 5 cycle latency

从表述来看，除了容量以外基本和 Zen 1 一致，猜测是在 Zen 1 的基础上扩大了容量。

下面结合微架构测试，进一步研究它的内部结构。

微架构测试¶

在之前的博客里，我们已经测试了各种处理器的 BTB，在这里也是一样的：按照一定的 stride 分布无条件直接分支，构成一个链条，然后测量 CPI。

考虑到 Zen 2 的 BTB 可能出现一个 entry 保存两条分支的情况，并且还对分支的类型有要求，因此下面的测试都会进行四组，分别对应四种分支模式：

uncond：所有分支都是无条件分支：uncond, uncond, uncond, uncond, ...
cond：所有分支都是条件分支：cond, cond, cond, cond, ...
mix (uncond + cond)：条件分支和无条件分支轮流出现，但 uncond 在先：uncond, cond, uncond, cond, ...
mix (cond + uncond)：条件分支和无条件分支轮流出现，但 cond 在先：cond, uncond, cond, uncond, ...

stride=4B¶

首先是 stride=4B 的情况：

可以看到，图像上出现了三个比较显著的台阶：

所有分支模式下，第一个台阶都是到 8 条分支，CPI=1，8 对应了 8-entry 的 L0 BTB
所有分支模式下，第二个台阶都是到 256 条分支，CPI=2，对应了 512-entry 的 L1 BTB，只体现出了一半的容量；但在 mix (uncond + cond) 和 mix (cond + uncond) 模式下，分支从 256 到 512 时 CPI 缓慢上升，意味着 L1 BTB 的 512-entry 还是可以完整访问，只是带来了一定的开销：CPI 从 2 增加到了 2.5
在 uncond 和 cond 模式下，第三个台阶到 4096 条分支，CPI=5，对应 L2 BTB，没有显现出完整的 7168 的大小
在 mix (uncond + cond) 模式下，第三个台阶延伸到了 5120，超出了 4096，依然没有显现出完整的 7168 的大小
在 mix (cond + uncond) 模式下，第三个台阶延伸到了 7168，显现出完整的 7168 的大小

和 Zen 1 不同，Zen 2 的 L1 BTB 出现了不同模式下容量不同的情况，原因未知，后续还会看到类似的情况。

Zen 2 的 L2 BTB 依然是带有压缩的，只有在 mix (cond + uncond) 模式下才可以尽可能地用上所有的容量，而其余的三种模式都有容量上的损失。

stride=8B¶

接下来观察 stride=8B 的情况：

现象和 stride=4B 基本相同，L1 BTB 从 256 到 512 部分的变化斜率有所不同，其余部分一致。

stride=16B¶

继续观察 stride=16B 的情况：

相比 stride=4B/8B，L0 BTB 和 L2 BTB 的行为没有变化；除了 cond 模式以外，L1 BTB 的容量减半到了 128，意味着 L1 BTB 采用了组相连，此时有一半的 set 不能被用上。此外，比较特别的是，从 stride=16B 开始，CPI=5 的平台出现了波动，uncond 模式下 CPI 从 5 变到 4 再变到了 5，猜测此时 L1 BTB 也有一定的比例会介入。

stride=32B¶

继续观察 stride=32B 的情况：

相比 stride=16B，L0 BTB 的行为没有变化；除了 cond 模式以外，L1 BTB 的容量进一步减到了 64，符合组相连的预期；L2 BTB 在 mix (uncond + cond) 模式下不再能体现出 5120 的容量，而是 4096：此时在一个 64B cacheline 中只有两条分支，第一条分支是 uncond，第二条分支是 cond，不满足 entry 共享的条件（必须 cond + uncond，不能是 uncond + cond），此时 uncond 和 cond 分别保存在两个 entry 中，每个 entry 只保存一条分支，因此 L2 BTB 只能体现出 4096 的容量。而 mix (cond + uncond) 模式依然满足 entry 共享的条件，所以依然体现出 7168 的容量。特别地，在 mix (cond + uncond) 模式下出现了非常剧烈的 CPI 抖动，可能出现了一些预期之外的性能问题。

stride=64B¶

继续观察 stride=64B 的情况：

相比 stride=32B，L0 BTB 的行为没有变化；除了 cond 模式以外，L1 BTB 的容量进一步减到了 32，符合组相连的预期，但 cond 模式下依然保持了 512 的容量；L2 BTB 在 mix (cond + uncond) 模式下只能体现出 4096 的容量，此时每个 64B cacheline 都只有一条分支，不满足两条分支共享一个 entry 的条件。

stride=128B¶

继续观察 stride=128B 的情况：

相比 stride=64B，L0 BTB 的行为没有变化；除了 cond 模式以外，L1 BTB 的容量进一步减到了 16，符合组相连的预期，而 cond 模式下 L1 BTB 容量也减少到了 256；L2 BTB 的容量减半到了 2048，意味着 L2 BTB 也是组相连结构。

小结¶

测试到这里就差不多了，更大的 stride 得到的也是类似的结果，总结一下前面的发现：

L0 BTB 是 (8+8)-entry，1 cycle latency，不随着 stride 变化，全相连
L1 BTB 是 512-entry，2 cycle latency，容量随着 stride 变化，大概率是 PC[n:3] 这一段被用于 index，使得 stride=16B 开始容量不断减半；但 cond 模式下的行为和其余几种模式不同，直到 stride=128B 才开始容量减半
L2 BTB 是 4096-entry，5 cycle latency，容量随着 stride 变化，大概率是 PC[n:6] 这一段被用于 index，使得 stride=128B 开始容量不断减半；其中有 3072 个 entry 最多保存两条分支，前提是这两条分支在同一个 cacheline 当中，并且第一条是 cond，第二条是 uncond；因此最多保存 7168 条分支

Zen 1 和 Zen 2 的 BTB 的对比¶

下面是对比表格：

uArch	AMD Zen 1	AMD Zen 2
L0 BTB size	4+4 branches	8+8 branches
L0 BTB latency	1 cycle	1 cycle
L1 BTB size	256 branches	512 branches
L1 BTB latency	2 cycles	2 cycles
L2 BTB size w/o sharing	2K branches	4K branches
L2 BTB size w/ sharing	4K branches	7K branches
L2 BTB latency	5 cycles	5 cycles
Technology Node	14nm	7nm
Release Year	2017	2019

可见 Zen 2 在容量上做了一定的扩展，但机制上比较类似；特别地，可能是观察到 cond + uncond 的压缩能够生效的比例没有那么高，所以只允许其中一部分 entry 被压缩，例如 4 路组相连，只有前 3 个 way 是可以保存两条分支；剩下的一个 way 只能保存一条分支。

AMD Zen 2 和 ARM Neoverse N1 的 BTB 的对比¶

AMD Zen 2 和 ARM Neoverse N1 都是在 2019 发布的处理器，下面对它们进行一个对比：

uArch	AMD Zen 2	ARM Neoverse N1
L0/Nano BTB size	8+8 branches	16 branches
L0/Nano BTB latency	1 cycle	1 cycle
L1/Micro BTB size	512 branches	64 branches
L1/Micro BTB latency	2 cycles	2 cycles
L2/Main BTB size w/o sharing	4K branches	3K*2 branches
L2/Main BTB size w/ sharing	7K branches	3K*2 branches
L2/Main BTB latency	5 cycles	2-3 cycles
Technology Node	7nm	7nm

可见 AMD Zen 2 在 BTB 容量上有优势，但是延迟要更长；两者都在最后一级 BTB 上做了压缩，但是压缩的方法和目的不同：

AMD Zen 2 的压缩方法是，把同一个 64B cacheline 内一条 cond 和一条 uncond 指令放在同一个 entry 当中。这样做的好处是，当预测到 cond 分支不跳转的时候，可以直接根据 uncond 指令的信息，得到下一个 fetch block 的地址；但是也对代码的结构有要求，必须是在同一个 cacheline 中，依次出现一个 cond 和一个 uncond
ARM Neoverse N1 的压缩方法是，根据立即数范围对分支进行分类，如果分支的立即数范围比较小，就只占用一个 entry 的一半也就是 41 bit；如果分支的立即数范围过大，就占用一个完整的 82 bit 的 entry；这主要是一个减少 SRAM 占用的优化，避免了所有的分支都要记录完整的 82 bit 信息；对代码的结构要求比较小，只要是跳转距离不太远的分支，都可以存到 41 bit 内

二者都没有实现一个周期预测两条分支，即 two taken（ARM 的说法是 two predicted branches per cycle）。这要等到 2020 年的 ARM Neoverse N2/V1，或者 2022 年的 AMD Zen 4 才被实现。

注意到 AMD 的 Software Optimization Guide for AMD EPYC™ 7002 Processors (Publication No. 56305) 文档里，有这么一段表述：

Branches whose target crosses a half-megabyte aligned boundary are unable to be installed in the L0 BTB or to share BTB entries with other branches.

也就是说，如果两个分支要共享一个 BTB entry，那么它们的目的地址不能跨越 512KB 边界，也就是和分支地址的偏移量不超过 19 位。按 48 位虚拟地址计算，如果 BTB entry 只记录一条分支，最多需要记录目的地址的完整 48 位地址；如果现在 BTB entry 要存两条分支，这两条分支的目的地址都只需要记录 19 位，加起来也就 38 位，还可以空余 10 位的信息用来维护 BTB sharing 所需的额外信息。

所以说到底，无论是 AMD 还是 ARM，做的事情都是对一个固定长度的 entry 设置了不同的格式，一个格式保存的地址位数多，但是只能保存一个分支；另一个格式保存的地址位数少，但是可以保存两个分支。区别就是 AMD 对两个分支的类型和位置有要求，而 ARM 允许这两个分支毫无关系。这就是不同厂商的取舍了。

AMD Zen 1 的 BTB 结构分析

Mon, 07 Jul 2025 00:00:00 +0000

AMD Zen 1 的 BTB 结构分析¶

背景¶

AMD Zen 1 是 AMD 在 2017 年发布的 Zen 系列第一代微架构。在之前，我们分析了 ARM Neoverse N1 和 V1 的 BTB，那么现在也把视线转到 AMD 上，看看 AMD 的 Zen 系列的 BTB 是如何演进的。

官方信息¶

AMD 在 Software Optimization Guide for AMD Family 17h Processors (Publication No. 55723) 中有如下的表述：

The branch target buffer (BTB) is a three-level structure accessed using the fetch address of the current fetch block.

Zen 1 的 BTB 有三级，是用当前 fetch block 的地址去查询。

Each BTB entry includes information for branches and their targets. Each BTB entry can hold up to two branches if the branches reside in the same 64-byte aligned cache line and the first branch is a conditional branch.

Zen 1 的 BTB entry 有一定的压缩能力，一个 entry 最多保存两条分支，前提是两条分支在同一个 64B 缓存行中，并且第一条分支是条件分支。这样，如果第二条分支是无条件分支，分支预测的时候，可以根据第一条分支的方向预测的结果，决定要用哪条分支的目的地址作为下一个 fetch block 的地址。虽然有压缩能力，但是没有提到单个周期预测两条分支，所以只是扩大了等效 BTB 容量。

例如，有这么一段代码：

# fetch block entrypoint entrypoint: # do something jnz targetA # do something jmp targetB

那么 jnz 和 jmp 指令可以放到同一个 entry 当中，一次读出来，然后对 jnz 指令进行分支方向预测：

如果 jnz 预测为跳转，那么当前 fetch block 从 entrypoint 开始，到 jnz 结束；下一个 fetch block 从 targetA 开始
如果 jnz 预测为不跳转，那么当前 fetch block 从 entrypoint 开始，到 jmp 结束；下一个 fetch block 从 targetB 开始

L0BTB holds 4 forward taken branches and 4 backward taken branches, and predicts with zero bubbles.

Zen 1 的第一级 BTB 可以保存 4 条前向分支和 4 条后向分支，预测不会带来流水线气泡，也就是说每个周期都可以预测一次。

L1BTB has 256 entries and creates one bubble if prediction differs from L0BTB.

Zen 1 的第二级 BTB 可以保存 256 个 entry，但不确定这个 entry 是否可以保存两条分支，也不确定这个 entry 数量代表了实际的 entry 数量还是分支数量，后续会做实验证实；预测会产生单个气泡，意味着它的延迟是两个周期。

L2BTB has 4096 entries and creates four bubbles if its prediction differs from L1BTB.

Zen 1 的第三级 BTB 可以保存 4096 个 entry，但不确定这个 entry 是否可以保存两条分支，也不确定这个 entry 数量代表了实际的 entry 数量还是分支数量，后续会做实验证实；预测会产生四个气泡，意味着它的延迟是五个周期。

简单整理一下官方信息，大概有三级 BTB：

(4+4)-entry L0 BTB, 1 cycle latency
256-entry L1 BTB, 2 cycle latency
4096-entry L2 BTB, 5 cycle latency

下面结合微架构测试，进一步研究它的内部结构。

微架构测试¶

在之前的博客里，我们已经测试了各种处理器的 BTB，在这里也是一样的：按照一定的 stride 分布无条件直接分支，构成一个链条，然后测量 CPI。

考虑到 Zen 1 的 BTB 可能出现一个 entry 保存两条分支的情况，并且还对分支的类型有要求，因此下面的测试都会进行四组，分别对应四种分支模式：

uncond：所有分支都是无条件分支：uncond, uncond, uncond, uncond, ...
cond：所有分支都是条件分支：cond, cond, cond, cond, ...
mix (uncond + cond)：条件分支和无条件分支轮流出现，但 uncond 在先：uncond, cond, uncond, cond, ...
mix (cond + uncond)：条件分支和无条件分支轮流出现，但 cond 在先：cond, uncond, cond, uncond, ...

stride=4B¶

首先是 stride=4B 的情况：

可以看到，图像上出现了三个比较显著的台阶：

所有分支模式下，第一个台阶都是到 4 条分支，CPI=1.25，比 1 周期略高，猜测是因为循环体比较小，循环结束的操作的开销没有平摊造成的；4 对应了 4-entry 的 L0 BTB
所有分支模式下，第二个台阶都是到 256 条分支，CPI=2，对应了 256-entry 的 L1 BTB，意味着 L1 BTB 没有做一个 BTB entry 记录两条分支的优化，实际上就是 256 个 entry 保存 256 条分支
在 uncond 和 cond 模式下，第三个台阶到 2048 条分支，CPI=5，对应 L2 BTB，没有显现出完整的 4096 的大小，意味着 L2 BTB 实际上只有 2048 个 entry，每个 entry 最多保存两条分支，而 uncond 和 cond 模式下，不满足每个 entry 保存两条分支的条件，所以只保存了 2048 条分支
在 mix (uncond + cond) 模式下，第三个台阶一直延伸到了 3072，超出了 2048，意味着出现了两条分支保存在一个 entry 的情况，但并没有体现出完整的 4096 条分支的大小
在 mix (cond + uncond) 模式下，第三个台阶延伸到了 4096，体现出完整的 4096 的 L2 BTB 大小

可以观察到，过了 L2 BTB 容量以后，性能骤降到十多个 cycle，此时还没有超出 L1 ICache 容量，这么长的延迟，即使是在 uncond 模式下，可以在译码的时候发现 uncond 分支并 redirect，也要 16+ 个周期，可见其流水线之长。

stride=8B¶

接下来观察 stride=8B 的情况：

现象和 stride=4B 基本相同，各级 BTB 显现出来的大小没有变化。

stride=16B¶

继续观察 stride=16B 的情况：

相比 stride=4B/8B，L0 BTB 的行为没有变化；L1 BTB 的容量减半到了 128，意味着 L1 BTB 采用了组相连，此时有一半的 set 不能被用上。此外，比较特别的是，从 stride=16B 开始，CPI=5 的平台出现了波动，CPI 从 5 变到 4 再变到了 5，猜测此时 L1 BTB 也有一定的比例会介入。L2 BTB 在 mix (uncond + cond) 模式下，拐点从 3072 前移到 2560。

stride=32B¶

继续观察 stride=32B 的情况：

相比 stride=16B，L0 BTB 的行为没有变化；L1 BTB 的容量进一步减到了 64，符合组相连的预期；L2 BTB 在 mix (uncond + cond) 模式下不再能体现出 3072 的容量，而是 2048：此时在一个 64B cacheline 中只有两条分支，第一条分支是 uncond，第二条分支是 cond，不满足 entry 共享的条件（必须 cond + uncond，不能是 uncond + cond），此时 uncond 和 cond 分别保存在两个 entry 中，每个 entry 只保存一条分支，因此 L2 BTB 只能体现出 2048 的容量。而 mix (cond + uncond) 模式依然满足 entry 共享的条件，所以依然体现出 4096 的容量。

stride=64B¶

继续观察 stride=64B 的情况：

相比 stride=32B，L0 BTB 的行为没有变化；L1 BTB 的容量进一步减到了 32，符合组相连的预期；L2 BTB 在 mix (cond + uncond) 模式下只能体现出 2048 的容量，此时每个 64B cacheline 都只有一条分支，不满足两条分支共享一个 entry 的条件。

stride=128B¶

继续观察 stride=128B 的情况：

相比 stride=64B，L0 BTB 的行为没有变化；L1 BTB 的容量进一步减到了 16，符合组相连的预期；L2 BTB 的容量减半到了 1024，意味着 L2 BTB 也是组相连结构。

小结¶

测试到这里就差不多了，更大的 stride 得到的也是类似的结果，总结一下前面的发现：

L0 BTB 是 (4+4)-entry，1 cycle latency，不随着 stride 变化，全相连
L1 BTB 是 256-entry，2 cycle latency，容量随着 stride 变化，大概率是 PC[n:3] 这一段被用于 index，使得 stride=16B 开始容量不断减半
L2 BTB 是 2048-entry，5 cycle latency，容量随着 stride 变化，大概率是 PC[n:6] 这一段被用于 index，使得 stride=128B 开始容量不断减半；每个 entry 最多保存两条分支，前提是这两条分支在同一个 cacheline 当中，并且第一条是 cond，第二条是 uncond；因此最多保存 4096 个分支

也总结一下前面发现了各种没有解释的遗留问题：

stride=4B/8B/16B 且为 mix (uncond + cond) 模式时，L2 BTB 体现出 3072/3072/2560 的容量，而非 4096：解析见后
L2 BTB 对应的 CPI=5 的台阶出现比较明显的，在 4-5 之间的波动：暂无解释

接下来尝试解析一下这些遗留问题背后的原理。部分遗留问题，并没有被解释出来，欢迎读者提出猜想。

解析遗留问题¶

stride=4B/8B/16B 且为 mix (uncond + cond) 模式时，L2 BTB 体现出 3072/3072/2560 的容量，而非 4096¶

前面测试出来，观察到两个奇怪的容量：3072 和 2560，分别有 3 和 5 的因子。下面通过进一步的实验，观察它的来源。

stride=16B 对应 2560 的 L2 BTB 容量¶

首先针对这个 2560 的拐点，做了一系列测试，在 stride=16B 的情况下，测试不同的 uncond/cond 分支的组合，下面是 64B cacheline 内四条分支的类型的不同组合（U 代表 Uncond，C 代表 Cond），以及该组合对应的容量：

CCCC: 2048（即 cond 模式）
CCCU: 2560
CCUC: 2560
CCUU: 2560
CUCC: 2560
CUCU: 4096（即 mix (cond + uncond) 模式）
CUUC: 2560
CUUU: 2560
UCCC: 2048
UCCU: 2560
UCUC: 2560（即 mix (uncond + cond) 模式）
UCUU: 2560
UUCC: 2048
UUCU: 2560
UUUC: 2048
UUUU: 2048（即 uncond 模式）

可以观察到，如果没有出现连续的 CU（CCCC/UCCC/UUCC/UUUC/UUUU），容量是 2048；如果出现了一组 CU（CCCU/CCUC/CCUU/CUCC/CUUC/CUUU/UCCU/UCUC/UCUU/UUCU），容量是 2560；出现了两组 CU（CUCU），就是 mix (cond + uncond) 模式，容量是 4096。

一种可能的猜想：

如果没有出现连续的 CU，那么每个 branch 占用一个 entry，那么容量就是 2048 个 branch
如果出现了一组 CU，那么一个 64B cacheline 里的 4 个 branch 对应 3 个 entry，那么前 2048 个 branch 对应 1536 个 entry，还剩下 512 个 entry，这些 entry 每个 entry 只放 1 个 branch（讨论见后），所以最后容量是 2048+512=2560 个 branch
如果出现了两组 CU，那么每一组 CU 的两个 branch 对应一个 entry，容量是 4096 个 branch

但是也遗留了一个问题，就是只有一组 CU 的情况下，为啥剩下的 512 个 entry 只放 512 个 branch，而不能放 1024 个 branch，按理说是可能再次出现 cond + uncond 合并？这个问题暂时还没有解释。

由此可以看出，2560 的来源是 4 路组相连，然后其中一路发生了 cond + uncond 的合并，所以最终是 5 个分支保存到 4 路当中，再来一条分支就会放不下。

stride=4B/8B 对应 3072 的 L2 BTB 容量¶

带着上面的分析，再去观察 stride=4B/8B 时的 3072:3072 有 3 的因子，所以大概率是从 2 路组相连得来，其中一路出现了 cond + uncond 的合并，所以出现了 3 个 branch 占用 2 个 entry 的情况，最后体现出来就是 3072 的 L2 BTB 容量。

似乎到这里，3072 和 2560 分别的 3 和 5 的因子都能解释了，剩下的就是解析具体的组相连的结构。

组相连分析¶

那么到底是 2 路组相连，还是 4 路组相连呢，另外这个组相连的 set 是怎么构成的呢？

首先回忆一下，在 ARM Neoverse N1 中，连续的 32B 内能放 6 个分支，但是 stride=8B 的时候，一次就会往同一个 set 里增加 4 个分支，于是一个 set 内的分支数从 0 变到 4 再变到 8，拐点出现在 4 个分支，而不是 6 个分支。因此为了达到前面出现的 3072 和 2560 的拐点，新增的分支也得均匀地分到各个 set 当中。

前面根据 L2 BTB 的容量分析到，L2 BTB 的 Index 可能是 PC[n:6]，但肯定不是简单的这么取，否则也会出现 ARM Neoverse N1 类似的问题。只能说明 PC[6] 往上有若干个 bit 是单独出现在 L2 BTB 的 Index 当中的，而 PC[5] 以下的 bit，可能以某种哈希函数的形式，参与到 Index 当中。

所以，L2 BTB 可能是以 PC[n:6] 作为 Index 去访问，然后内部有多个 bank，每个 bank 内部是 2 路组相连。bank index 是通过 PC 经过哈希计算得来，使得在 stride=4B/8B 的时候，体现出 2 路组相连，而在 stride=16B 的时候，体现出 4 路组相连。同时，分支还能够均匀地分布到各个 bank 当中，避免了和 ARM Neoverse N1 类似的情况的发生。