-
Notifications
You must be signed in to change notification settings - Fork 22.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TorchInductor CPU Performance Dashboard #93531
Comments
Performance Dashboard for float32 precision -- Single-Socket Multi-threadsExecutive Summarysee moreWe evaluate torchinductor across three benchmark suites - torchbench, huggingface and timm. We run these experiments on ICX 8375C. Each experiment runs one iteration of forward pass. For accuracy, we check the numerical correctness of forward pass outputs by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio.Caveats
To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks. Passrate
Geometric mean speedup
Mean compilation time (seconds)
Peak memory footprint compression ratio (higher is better)
torchbench suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
huggingface suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
timm_models suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
|
Performance Dashboard for float32 precision -- Single-core Single-threadExecutive Summarysee moreWe evaluate torchinductor across three benchmark suites - torchbench, huggingface and timm. We run these experiments on Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz. Each experiment runs one iteration of forward pass. For accuracy, we check the numerical correctness of forward pass outputs by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio.Caveats
To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks. Passrate
Geometric mean speedup
Mean compilation time (seconds)
Peak memory footprint compression ratio (higher is better)
torchbench suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
huggingface suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
timm_models suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
|
This is inference? Or training? |
Inference. |
Performance Dashboard for float32 precision -- Single-Socket Multi-threadsExecutive Summarysee moreWe evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz. Each experiment runs one iteration of forward pass. For accuracy, we check the numerical correctness of forward pass outputs by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio.Caveats
To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks. Passrate
Geometric mean speedup
Mean compilation time (seconds)
Peak memory footprint compression ratio (higher is better)
torchbench suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
huggingface suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
timm_models suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
|
Performance Dashboard for float32 precision -- Single-core Single-threadExecutive Summarysee moreWe evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz. Each experiment runs one iteration of forward pass. For accuracy, we check the numerical correctness of forward pass outputs by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio.Caveats
To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks. Passrate
Geometric mean speedup
Mean compilation time (seconds)
Peak memory footprint compression ratio (higher is better)
torchbench suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
huggingface suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
timm_models suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
|
Performance Dashboard for float32 precision -- Single-Socket Multi-threadsExecutive Summarysee moreWe evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz. Each experiment runs one iteration of forward pass. For accuracy, we check the numerical correctness of forward pass outputs by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio.Caveats
To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks. Passrate
Geometric mean speedup
Mean compilation time (seconds)
Peak memory footprint compression ratio (higher is better)
torchbench suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
huggingface suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
timm_models suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
|
Performance Dashboard for float32 precision -- Single-Socket Multi-threadsExecutive Summarysee moreWe evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz. Each experiment runs one iteration of forward pass. For accuracy, we check the numerical correctness of forward pass outputs by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio.Caveats
To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks. Passrate
Geometric mean speedup
Mean compilation time (seconds)
Peak memory footprint compression ratio (higher is better)
torchbench suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
huggingface suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
timm_models suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
|
Performance Dashboard for float32 precision -- Single-core Single-threadExecutive Summarysee moreWe evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz. Each experiment runs one iteration of forward pass. For accuracy, we check the numerical correctness of forward pass outputs by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio.Caveats
To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks. Passrate
Geometric mean speedup
Mean compilation time (seconds)
Peak memory footprint compression ratio (higher is better)
torchbench suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
huggingface suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
timm_models suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
|
Performance Dashboard for float32 precision -- Single-Socket Multi-threadsExecutive Summarysee moreWe evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz. Each experiment runs one iteration of forward pass. For accuracy, we check the numerical correctness of forward pass outputs by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio.Caveats
To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks. Passrate
Geometric mean speedup
Mean compilation time (seconds)
Peak memory footprint compression ratio (higher is better)
torchbench suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
huggingface suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
timm_models suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
|
Performance Dashboard for float32 precision -- Single-core Single-threadExecutive Summarysee moreWe evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz. Each experiment runs one iteration of forward pass. For accuracy, we check the numerical correctness of forward pass outputs by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio.Caveats
To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks. Passrate
Geometric mean speedup
Mean compilation time (seconds)
Peak memory footprint compression ratio (higher is better)
torchbench suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
huggingface suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
timm_models suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
|
Performance Dashboard for float32 precision -- Single-Socket Multi-threads (2022-11-09 nightly release)Executive Summarysee moreWe evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz. Each experiment runs one iteration of forward pass. For accuracy, we check the numerical correctness of forward pass outputs by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio.Caveats
SW information
HW information
To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks. Passrate
Geometric mean speedup
Mean compilation time (seconds)
Peak memory footprint compression ratio (higher is better)
torchbench suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
huggingface suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
timm_models suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
|
Performance Dashboard for float32 precision -- Single-core Single-thread (2022-11-09 nightly release)Executive Summarysee moreWe evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz. Each experiment runs one iteration of forward pass. For accuracy, we check the numerical correctness of forward pass outputs by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio.Caveats
SW information
HW information
To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks. Passrate
Geometric mean speedup
Mean compilation time (seconds)
Peak memory footprint compression ratio (higher is better)
torchbench suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
huggingface suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
timm_models suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
|
Performance Dashboard for float32 precision -- Single-Socket Multi-threads (2022-11-13 nightly release)Executive Summarysee moreWe evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz. Each experiment runs one iteration of forward pass. For accuracy, we check the numerical correctness of forward pass outputs by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio.Caveats
SW information
HW information
To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks. Passrate
Geometric mean speedup
Mean compilation time (seconds)
Peak memory footprint compression ratio (higher is better)
torchbench suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
huggingface suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
timm_models suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
|
Performance Dashboard for float32 precision -- Single-core Single-thread (2022-11-13 nightly release)Executive Summarysee moreWe evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz. Each experiment runs one iteration of forward pass. For accuracy, we check the numerical correctness of forward pass outputs by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio.Caveats
SW information
HW information
Update: We use single-instance mode in this round. Passrate
Geometric mean speedup
Mean compilation time (seconds)
Peak memory footprint compression ratio (higher is better)
torchbench suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
huggingface suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
timm_models suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
|
Performance Dashboard for float32 precision -- Single-Socket Multi-threads (2022-11-16 nightly release)Executive Summarysee moreWe evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz. Each experiment runs one iteration of forward pass. For accuracy, we check the numerical correctness of forward pass outputs by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio.Caveats
SW information
HW information
To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks. Passrate
Geometric mean speedup
Mean compilation time (seconds)
Peak memory footprint compression ratio (higher is better)
torchbench suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
huggingface suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
timm_models suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
|
Performance Dashboard for float32 precision -- Single-core Single-thread (2022-11-16 nightly release)Executive Summarysee moreWe evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz. Each experiment runs one iteration of forward pass. For accuracy, we check the numerical correctness of forward pass outputs by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio.Caveats
SW information
HW information
To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks. Passrate
Geometric mean speedup
Mean compilation time (seconds)
Peak memory footprint compression ratio (higher is better)
torchbench suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
huggingface suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
timm_models suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
|
[cppwrapper_dynamic_shape] Performance Dashboard for float32 precision -- Single-core Single-thread (2024-09-29 nightly release)Executive Summarysee moreWe evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz. Each experiment runs one iteration of forward pass and backward pass for training and forward pass only for inference. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio.Caveats
SW information:
HW information
Test command export LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so:${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libjemalloc.so
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"
export TORCHINDUCTOR_FREEZING=1
export OMP_NUM_THREADS=1
python benchmarks/dynamo/runner.py --enable_cpu_launcher --cpu_launcher_args "--core_list 0 --ncores_per_instance 1" --devices=cpu --dtypes=float32 --inference --compilers=inductor --batch_size=1 --threads 1 --extra-args="--timeout 9000"
To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks. Passrate
Geometric mean speedup
Mean compilation time (seconds)
Peak memory footprint compression ratio (higher is better)
torchbench suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
huggingface suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
timm_models suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
|
[cppwrapper_static_shape] Performance Dashboard for float32 precision -- Single-core Single-thread (2024-09-29 nightly release)Executive Summarysee moreWe evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz. Each experiment runs one iteration of forward pass and backward pass for training and forward pass only for inference. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio.Caveats
SW information:
HW information
Test command export LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so:${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libjemalloc.so
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"
export TORCHINDUCTOR_FREEZING=1
export OMP_NUM_THREADS=1
python benchmarks/dynamo/runner.py --enable_cpu_launcher --cpu_launcher_args "--core_list 0 --ncores_per_instance 1" --devices=cpu --dtypes=float32 --inference --compilers=inductor --batch_size=1 --threads 1 --extra-args="--timeout 9000"
To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks. Passrate
Geometric mean speedup
Mean compilation time (seconds)
Peak memory footprint compression ratio (higher is better)
torchbench suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
huggingface suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
timm_models suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
|
[amp] Performance Dashboard for amp precision -- Single-Socket Multi-threads (2024-10-13 nightly release)Executive Summarysee moreWe evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on Intel(R) Xeon(R) Platinum 8488C. Each experiment runs one iteration of forward pass and backward pass for training and forward pass only for inference. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio.Caveats
SW information:
HW information
Test command export LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so:${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libjemalloc.so
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"
export TORCHINDUCTOR_FREEZING=1
CORES=$(lscpu | grep Core | awk '{print $4}')
export OMP_NUM_THREADS=$CORES
python benchmarks/dynamo/runner.py --enable_cpu_launcher --cpu_launcher_args "--node_id 0" --devices=cpu --dtypes=float32 --inference --compilers=inductor --extra-args="--timeout 9000"
To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks. Passrate
Geometric mean speedup
Mean compilation time (seconds)
Peak memory footprint compression ratio (higher is better)
torchbench suite with amp precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
huggingface suite with amp precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
timm_models suite with amp precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
|
[amp] Performance Dashboard for amp precision -- Single-core Single-thread (2024-10-13 nightly release)Executive Summarysee moreWe evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on Intel(R) Xeon(R) Platinum 8488C. Each experiment runs one iteration of forward pass and backward pass for training and forward pass only for inference. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio.Caveats
SW information:
HW information
Test command export LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so:${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libjemalloc.so
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"
export TORCHINDUCTOR_FREEZING=1
export OMP_NUM_THREADS=1
python benchmarks/dynamo/runner.py --enable_cpu_launcher --cpu_launcher_args "--core_list 0 --ncores_per_instance 1" --devices=cpu --dtypes=float32 --inference --compilers=inductor --batch_size=1 --threads 1 --extra-args="--timeout 9000"
To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks. Passrate
Geometric mean speedup
Mean compilation time (seconds)
Peak memory footprint compression ratio (higher is better)
torchbench suite with amp precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
huggingface suite with amp precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
timm_models suite with amp precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
|
[amp] Performance Dashboard for amp precision -- Single-Socket Multi-threads (2024-10-13 nightly release)Executive Summarysee moreWe evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on Intel(R) Xeon(R) Platinum 8488C. Each experiment runs one iteration of forward pass and backward pass for training and forward pass only for inference. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio.Caveats
SW information:
HW information
Test command export LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so:${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libjemalloc.so
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"
export TORCHINDUCTOR_FREEZING=1
CORES=$(lscpu | grep Core | awk '{print $4}')
export OMP_NUM_THREADS=$CORES
python benchmarks/dynamo/runner.py --enable_cpu_launcher --cpu_launcher_args "--node_id 0" --devices=cpu --dtypes=float32 --inference --compilers=inductor --extra-args="--timeout 9000"
To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks. Passrate
Geometric mean speedup
Mean compilation time (seconds)
Peak memory footprint compression ratio (higher is better)
torchbench suite with amp precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
huggingface suite with amp precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
timm_models suite with amp precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
|
[amp] Performance Dashboard for amp precision -- Single-core Single-thread (2024-10-13 nightly release)Executive Summarysee moreWe evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on Intel(R) Xeon(R) Platinum 8488C. Each experiment runs one iteration of forward pass and backward pass for training and forward pass only for inference. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio.Caveats
SW information:
HW information
Test command export LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so:${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libjemalloc.so
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"
export TORCHINDUCTOR_FREEZING=1
export OMP_NUM_THREADS=1
python benchmarks/dynamo/runner.py --enable_cpu_launcher --cpu_launcher_args "--core_list 0 --ncores_per_instance 1" --devices=cpu --dtypes=float32 --inference --compilers=inductor --batch_size=1 --threads 1 --extra-args="--timeout 9000"
To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks. Passrate
Geometric mean speedup
Mean compilation time (seconds)
Peak memory footprint compression ratio (higher is better)
torchbench suite with amp precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
huggingface suite with amp precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
timm_models suite with amp precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
|
[cppwrapper_static_shape] Performance Dashboard for float32 precision -- Single-Socket Multi-threads (2024-10-13 nightly release)Executive Summarysee moreWe evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz. Each experiment runs one iteration of forward pass and backward pass for training and forward pass only for inference. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio.Caveats
SW information:
HW information
Test command export LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so:${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libjemalloc.so
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"
export TORCHINDUCTOR_FREEZING=1
CORES=$(lscpu | grep Core | awk '{print $4}')
export OMP_NUM_THREADS=$CORES
python benchmarks/dynamo/runner.py --enable_cpu_launcher --cpu_launcher_args "--node_id 0" --devices=cpu --dtypes=float32 --inference --compilers=inductor --extra-args="--timeout 9000"
To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks. Passrate
Geometric mean speedup
Mean compilation time (seconds)
Peak memory footprint compression ratio (higher is better)
torchbench suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
huggingface suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
timm_models suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
|
[cppwrapper_static_shape] Performance Dashboard for float32 precision -- Single-core Single-thread (2024-10-13 nightly release)Executive Summarysee moreWe evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz. Each experiment runs one iteration of forward pass and backward pass for training and forward pass only for inference. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio.Caveats
SW information:
HW information
Test command export LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so:${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libjemalloc.so
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"
export TORCHINDUCTOR_FREEZING=1
export OMP_NUM_THREADS=1
python benchmarks/dynamo/runner.py --enable_cpu_launcher --cpu_launcher_args "--core_list 0 --ncores_per_instance 1" --devices=cpu --dtypes=float32 --inference --compilers=inductor --batch_size=1 --threads 1 --extra-args="--timeout 9000"
To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks. Passrate
Geometric mean speedup
Mean compilation time (seconds)
Peak memory footprint compression ratio (higher is better)
torchbench suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
huggingface suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
timm_models suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
|
[amp] Performance Dashboard for amp precision -- Single-Socket Multi-threads (2024-10-13 nightly release)Executive Summarysee moreWe evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on Intel(R) Xeon(R) Platinum 8488C. Each experiment runs one iteration of forward pass and backward pass for training and forward pass only for inference. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio.Caveats
SW information:
HW information
Test command export LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so:${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libjemalloc.so
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"
export TORCHINDUCTOR_FREEZING=1
CORES=$(lscpu | grep Core | awk '{print $4}')
export OMP_NUM_THREADS=$CORES
python benchmarks/dynamo/runner.py --enable_cpu_launcher --cpu_launcher_args "--node_id 0" --devices=cpu --dtypes=float32 --inference --compilers=inductor --extra-args="--timeout 9000"
To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks. Passrate
Geometric mean speedup
Mean compilation time (seconds)
Peak memory footprint compression ratio (higher is better)
torchbench suite with amp precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
huggingface suite with amp precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
timm_models suite with amp precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
|
[amp] Performance Dashboard for amp precision -- Single-core Single-thread (2024-10-13 nightly release)Executive Summarysee moreWe evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on Intel(R) Xeon(R) Platinum 8488C. Each experiment runs one iteration of forward pass and backward pass for training and forward pass only for inference. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio.Caveats
SW information:
HW information
Test command export LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so:${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libjemalloc.so
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"
export TORCHINDUCTOR_FREEZING=1
export OMP_NUM_THREADS=1
python benchmarks/dynamo/runner.py --enable_cpu_launcher --cpu_launcher_args "--core_list 0 --ncores_per_instance 1" --devices=cpu --dtypes=float32 --inference --compilers=inductor --batch_size=1 --threads 1 --extra-args="--timeout 9000"
To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks. Passrate
Geometric mean speedup
Mean compilation time (seconds)
Peak memory footprint compression ratio (higher is better)
torchbench suite with amp precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
huggingface suite with amp precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
timm_models suite with amp precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
|
[amp] Performance Dashboard for amp precision -- Single-Socket Multi-threads (2024-10-13 nightly release)Executive Summarysee moreWe evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on Intel(R) Xeon(R) Platinum 8488C. Each experiment runs one iteration of forward pass and backward pass for training and forward pass only for inference. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio.Caveats
SW information:
HW information
Test command export LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so:${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libjemalloc.so
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"
export TORCHINDUCTOR_FREEZING=1
CORES=$(lscpu | grep Core | awk '{print $4}')
export OMP_NUM_THREADS=$CORES
python benchmarks/dynamo/runner.py --enable_cpu_launcher --cpu_launcher_args "--node_id 0" --devices=cpu --dtypes=float32 --inference --compilers=inductor --extra-args="--timeout 9000"
To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks. Passrate
Geometric mean speedup
Mean compilation time (seconds)
Peak memory footprint compression ratio (higher is better)
torchbench suite with amp precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
huggingface suite with amp precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
timm_models suite with amp precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
|
[amp] Performance Dashboard for amp precision -- Single-core Single-thread (2024-10-13 nightly release)Executive Summarysee moreWe evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on Intel(R) Xeon(R) Platinum 8488C. Each experiment runs one iteration of forward pass and backward pass for training and forward pass only for inference. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio.Caveats
SW information:
HW information
Test command export LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so:${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libjemalloc.so
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"
export TORCHINDUCTOR_FREEZING=1
export OMP_NUM_THREADS=1
python benchmarks/dynamo/runner.py --enable_cpu_launcher --cpu_launcher_args "--core_list 0 --ncores_per_instance 1" --devices=cpu --dtypes=float32 --inference --compilers=inductor --batch_size=1 --threads 1 --extra-args="--timeout 9000"
To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks. Passrate
Geometric mean speedup
Mean compilation time (seconds)
Peak memory footprint compression ratio (higher is better)
torchbench suite with amp precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
huggingface suite with amp precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
timm_models suite with amp precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
|
[cppwrapper_dynamic_shape] Performance Dashboard for float32 precision -- Single-Socket Multi-threads (2024-10-14 nightly release)Executive Summarysee moreWe evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz. Each experiment runs one iteration of forward pass and backward pass for training and forward pass only for inference. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio.Caveats
SW information:
HW information
Test command export LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so:${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libjemalloc.so
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"
export TORCHINDUCTOR_FREEZING=1
CORES=$(lscpu | grep Core | awk '{print $4}')
export OMP_NUM_THREADS=$CORES
python benchmarks/dynamo/runner.py --enable_cpu_launcher --cpu_launcher_args "--node_id 0" --devices=cpu --dtypes=float32 --inference --compilers=inductor --extra-args="--timeout 9000"
To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks. Passrate
Geometric mean speedup
Mean compilation time (seconds)
Peak memory footprint compression ratio (higher is better)
torchbench suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
huggingface suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
timm_models suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
|
[cppwrapper_dynamic_shape] Performance Dashboard for float32 precision -- Single-core Single-thread (2024-10-14 nightly release)Executive Summarysee moreWe evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz. Each experiment runs one iteration of forward pass and backward pass for training and forward pass only for inference. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio.Caveats
SW information:
HW information
Test command export LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so:${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libjemalloc.so
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"
export TORCHINDUCTOR_FREEZING=1
export OMP_NUM_THREADS=1
python benchmarks/dynamo/runner.py --enable_cpu_launcher --cpu_launcher_args "--core_list 0 --ncores_per_instance 1" --devices=cpu --dtypes=float32 --inference --compilers=inductor --batch_size=1 --threads 1 --extra-args="--timeout 9000"
To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks. Passrate
Geometric mean speedup
Mean compilation time (seconds)
Peak memory footprint compression ratio (higher is better)
torchbench suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
huggingface suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
timm_models suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
|
[amp] Performance Dashboard for amp precision -- Single-Socket Multi-threads (2024-10-19 nightly release)Executive Summarysee moreWe evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on Intel(R) Xeon(R) Platinum 8488C. Each experiment runs one iteration of forward pass and backward pass for training and forward pass only for inference. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio.Caveats
SW information:
HW information
Test command export LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so:${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libjemalloc.so
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"
export TORCHINDUCTOR_FREEZING=1
CORES=$(lscpu | grep Core | awk '{print $4}')
export OMP_NUM_THREADS=$CORES
python benchmarks/dynamo/runner.py --enable_cpu_launcher --cpu_launcher_args "--node_id 0" --devices=cpu --dtypes=float32 --inference --compilers=inductor --extra-args="--timeout 9000"
To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks. Passrate
Geometric mean speedup
Mean compilation time (seconds)
Peak memory footprint compression ratio (higher is better)
torchbench suite with amp precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
huggingface suite with amp precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
timm_models suite with amp precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
|
[amp] Performance Dashboard for amp precision -- Single-core Single-thread (2024-10-19 nightly release)Executive Summarysee moreWe evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on Intel(R) Xeon(R) Platinum 8488C. Each experiment runs one iteration of forward pass and backward pass for training and forward pass only for inference. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio.Caveats
SW information:
HW information
Test command export LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so:${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libjemalloc.so
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"
export TORCHINDUCTOR_FREEZING=1
export OMP_NUM_THREADS=1
python benchmarks/dynamo/runner.py --enable_cpu_launcher --cpu_launcher_args "--core_list 0 --ncores_per_instance 1" --devices=cpu --dtypes=float32 --inference --compilers=inductor --batch_size=1 --threads 1 --extra-args="--timeout 9000"
To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks. Passrate
Geometric mean speedup
Mean compilation time (seconds)
Peak memory footprint compression ratio (higher is better)
torchbench suite with amp precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
huggingface suite with amp precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
timm_models suite with amp precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
|
[amp] Performance Dashboard for amp precision -- Single-Socket Multi-threads (2024-10-19 nightly release)Executive Summarysee moreWe evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on Intel(R) Xeon(R) Platinum 8488C. Each experiment runs one iteration of forward pass and backward pass for training and forward pass only for inference. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio.Caveats
SW information:
HW information
Test command export LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so:${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libjemalloc.so
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"
export TORCHINDUCTOR_FREEZING=1
CORES=$(lscpu | grep Core | awk '{print $4}')
export OMP_NUM_THREADS=$CORES
python benchmarks/dynamo/runner.py --enable_cpu_launcher --cpu_launcher_args "--node_id 0" --devices=cpu --dtypes=float32 --inference --compilers=inductor --extra-args="--timeout 9000"
To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks. Passrate
Geometric mean speedup
Mean compilation time (seconds)
Peak memory footprint compression ratio (higher is better)
torchbench suite with amp precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
huggingface suite with amp precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
timm_models suite with amp precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
|
[amp] Performance Dashboard for amp precision -- Single-core Single-thread (2024-10-19 nightly release)Executive Summarysee moreWe evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on Intel(R) Xeon(R) Platinum 8488C. Each experiment runs one iteration of forward pass and backward pass for training and forward pass only for inference. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio.Caveats
SW information:
HW information
Test command export LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so:${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libjemalloc.so
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"
export TORCHINDUCTOR_FREEZING=1
export OMP_NUM_THREADS=1
python benchmarks/dynamo/runner.py --enable_cpu_launcher --cpu_launcher_args "--core_list 0 --ncores_per_instance 1" --devices=cpu --dtypes=float32 --inference --compilers=inductor --batch_size=1 --threads 1 --extra-args="--timeout 9000"
To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks. Passrate
Geometric mean speedup
Mean compilation time (seconds)
Peak memory footprint compression ratio (higher is better)
torchbench suite with amp precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
huggingface suite with amp precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
timm_models suite with amp precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
|
[cppwrapper_static_shape] Performance Dashboard for float32 precision -- Single-Socket Multi-threads (2024-10-20 nightly release)Executive Summarysee moreWe evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz. Each experiment runs one iteration of forward pass and backward pass for training and forward pass only for inference. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio.Caveats
SW information:
HW information
Test command export LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so:${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libjemalloc.so
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"
export TORCHINDUCTOR_FREEZING=1
CORES=$(lscpu | grep Core | awk '{print $4}')
export OMP_NUM_THREADS=$CORES
python benchmarks/dynamo/runner.py --enable_cpu_launcher --cpu_launcher_args "--node_id 0" --devices=cpu --dtypes=float32 --inference --compilers=inductor --extra-args="--timeout 9000"
To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks. Passrate
Geometric mean speedup
Mean compilation time (seconds)
Peak memory footprint compression ratio (higher is better)
torchbench suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
huggingface suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
timm_models suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
|
[cppwrapper_static_shape] Performance Dashboard for float32 precision -- Single-core Single-thread (2024-10-20 nightly release)Executive Summarysee moreWe evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz. Each experiment runs one iteration of forward pass and backward pass for training and forward pass only for inference. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio.Caveats
SW information:
HW information
Test command export LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so:${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libjemalloc.so
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"
export TORCHINDUCTOR_FREEZING=1
export OMP_NUM_THREADS=1
python benchmarks/dynamo/runner.py --enable_cpu_launcher --cpu_launcher_args "--core_list 0 --ncores_per_instance 1" --devices=cpu --dtypes=float32 --inference --compilers=inductor --batch_size=1 --threads 1 --extra-args="--timeout 9000"
To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks. Passrate
Geometric mean speedup
Mean compilation time (seconds)
Peak memory footprint compression ratio (higher is better)
torchbench suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
huggingface suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
timm_models suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
|
[cppwrapper_dynamic_shape] Performance Dashboard for float32 precision -- Single-Socket Multi-threads (2024-10-20 nightly release)Executive Summarysee moreWe evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz. Each experiment runs one iteration of forward pass and backward pass for training and forward pass only for inference. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio.Caveats
SW information:
HW information
Test command export LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so:${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libjemalloc.so
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"
export TORCHINDUCTOR_FREEZING=1
CORES=$(lscpu | grep Core | awk '{print $4}')
export OMP_NUM_THREADS=$CORES
python benchmarks/dynamo/runner.py --enable_cpu_launcher --cpu_launcher_args "--node_id 0" --devices=cpu --dtypes=float32 --inference --compilers=inductor --extra-args="--timeout 9000"
To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks. Passrate
Geometric mean speedup
Mean compilation time (seconds)
Peak memory footprint compression ratio (higher is better)
torchbench suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
huggingface suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
timm_models suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
|
[cppwrapper_dynamic_shape] Performance Dashboard for float32 precision -- Single-core Single-thread (2024-10-20 nightly release)Executive Summarysee moreWe evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz. Each experiment runs one iteration of forward pass and backward pass for training and forward pass only for inference. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio.Caveats
SW information:
HW information
Test command export LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so:${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libjemalloc.so
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"
export TORCHINDUCTOR_FREEZING=1
export OMP_NUM_THREADS=1
python benchmarks/dynamo/runner.py --enable_cpu_launcher --cpu_launcher_args "--core_list 0 --ncores_per_instance 1" --devices=cpu --dtypes=float32 --inference --compilers=inductor --batch_size=1 --threads 1 --extra-args="--timeout 9000"
To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks. Passrate
Geometric mean speedup
Mean compilation time (seconds)
Peak memory footprint compression ratio (higher is better)
torchbench suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
huggingface suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
timm_models suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
|
[amp] Performance Dashboard for amp precision -- Single-Socket Multi-threads (2024-10-26 nightly release)Executive Summarysee moreWe evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on Intel(R) Xeon(R) Platinum 8488C. Each experiment runs one iteration of forward pass and backward pass for training and forward pass only for inference. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio.Caveats
SW information:
HW information
Test command export LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so:${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libjemalloc.so
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"
export TORCHINDUCTOR_FREEZING=1
CORES=$(lscpu | grep Core | awk '{print $4}')
export OMP_NUM_THREADS=$CORES
python benchmarks/dynamo/runner.py --enable_cpu_launcher --cpu_launcher_args "--node_id 0" --devices=cpu --dtypes=float32 --inference --compilers=inductor --extra-args="--timeout 9000"
To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks. Passrate
Geometric mean speedup
Mean compilation time (seconds)
Peak memory footprint compression ratio (higher is better)
torchbench suite with amp precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
huggingface suite with amp precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
timm_models suite with amp precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
|
[amp] Performance Dashboard for amp precision -- Single-core Single-thread (2024-10-26 nightly release)Executive Summarysee moreWe evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on Intel(R) Xeon(R) Platinum 8488C. Each experiment runs one iteration of forward pass and backward pass for training and forward pass only for inference. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio.Caveats
SW information:
HW information
Test command export LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so:${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libjemalloc.so
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"
export TORCHINDUCTOR_FREEZING=1
export OMP_NUM_THREADS=1
python benchmarks/dynamo/runner.py --enable_cpu_launcher --cpu_launcher_args "--core_list 0 --ncores_per_instance 1" --devices=cpu --dtypes=float32 --inference --compilers=inductor --batch_size=1 --threads 1 --extra-args="--timeout 9000"
To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks. Passrate
Geometric mean speedup
Mean compilation time (seconds)
Peak memory footprint compression ratio (higher is better)
torchbench suite with amp precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
huggingface suite with amp precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
timm_models suite with amp precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
|
[amp] Performance Dashboard for amp precision -- Single-Socket Multi-threads (2024-10-26 nightly release)Executive Summarysee moreWe evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on Intel(R) Xeon(R) Platinum 8488C. Each experiment runs one iteration of forward pass and backward pass for training and forward pass only for inference. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio.Caveats
SW information:
HW information
Test command export LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so:${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libjemalloc.so
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"
export TORCHINDUCTOR_FREEZING=1
CORES=$(lscpu | grep Core | awk '{print $4}')
export OMP_NUM_THREADS=$CORES
python benchmarks/dynamo/runner.py --enable_cpu_launcher --cpu_launcher_args "--node_id 0" --devices=cpu --dtypes=float32 --inference --compilers=inductor --extra-args="--timeout 9000"
To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks. Passrate
Geometric mean speedup
Mean compilation time (seconds)
Peak memory footprint compression ratio (higher is better)
torchbench suite with amp precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
huggingface suite with amp precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
timm_models suite with amp precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
|
[amp] Performance Dashboard for amp precision -- Single-core Single-thread (2024-10-26 nightly release)Executive Summarysee moreWe evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on Intel(R) Xeon(R) Platinum 8488C. Each experiment runs one iteration of forward pass and backward pass for training and forward pass only for inference. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio.Caveats
SW information:
HW information
Test command export LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so:${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libjemalloc.so
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"
export TORCHINDUCTOR_FREEZING=1
export OMP_NUM_THREADS=1
python benchmarks/dynamo/runner.py --enable_cpu_launcher --cpu_launcher_args "--core_list 0 --ncores_per_instance 1" --devices=cpu --dtypes=float32 --inference --compilers=inductor --batch_size=1 --threads 1 --extra-args="--timeout 9000"
To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks. Passrate
Geometric mean speedup
Mean compilation time (seconds)
Peak memory footprint compression ratio (higher is better)
torchbench suite with amp precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
huggingface suite with amp precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
timm_models suite with amp precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
|
[cppwrapper_static_shape] Performance Dashboard for float32 precision -- Single-Socket Multi-threads (2024-10-27 nightly release)Executive Summarysee moreWe evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz. Each experiment runs one iteration of forward pass and backward pass for training and forward pass only for inference. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio.Caveats
SW information:
HW information
Test command export LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so:${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libjemalloc.so
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"
export TORCHINDUCTOR_FREEZING=1
CORES=$(lscpu | grep Core | awk '{print $4}')
export OMP_NUM_THREADS=$CORES
python benchmarks/dynamo/runner.py --enable_cpu_launcher --cpu_launcher_args "--node_id 0" --devices=cpu --dtypes=float32 --inference --compilers=inductor --extra-args="--timeout 9000"
To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks. Passrate
Geometric mean speedup
Mean compilation time (seconds)
Peak memory footprint compression ratio (higher is better)
torchbench suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
huggingface suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
timm_models suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
|
[cppwrapper_static_shape] Performance Dashboard for float32 precision -- Single-core Single-thread (2024-10-27 nightly release)Executive Summarysee moreWe evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz. Each experiment runs one iteration of forward pass and backward pass for training and forward pass only for inference. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio.Caveats
SW information:
HW information
Test command export LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so:${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libjemalloc.so
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"
export TORCHINDUCTOR_FREEZING=1
export OMP_NUM_THREADS=1
python benchmarks/dynamo/runner.py --enable_cpu_launcher --cpu_launcher_args "--core_list 0 --ncores_per_instance 1" --devices=cpu --dtypes=float32 --inference --compilers=inductor --batch_size=1 --threads 1 --extra-args="--timeout 9000"
To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks. Passrate
Geometric mean speedup
Mean compilation time (seconds)
Peak memory footprint compression ratio (higher is better)
torchbench suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
huggingface suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
timm_models suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
|
[cppwrapper_dynamic_shape] Performance Dashboard for float32 precision -- Single-Socket Multi-threads (2024-10-27 nightly release)Executive Summarysee moreWe evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz. Each experiment runs one iteration of forward pass and backward pass for training and forward pass only for inference. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio.Caveats
SW information:
HW information
Test command export LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so:${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libjemalloc.so
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"
export TORCHINDUCTOR_FREEZING=1
CORES=$(lscpu | grep Core | awk '{print $4}')
export OMP_NUM_THREADS=$CORES
python benchmarks/dynamo/runner.py --enable_cpu_launcher --cpu_launcher_args "--node_id 0" --devices=cpu --dtypes=float32 --inference --compilers=inductor --extra-args="--timeout 9000"
To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks. Passrate
Geometric mean speedup
Mean compilation time (seconds)
Peak memory footprint compression ratio (higher is better)
torchbench suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
huggingface suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
timm_models suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
|
[cppwrapper_dynamic_shape] Performance Dashboard for float32 precision -- Single-core Single-thread (2024-10-27 nightly release)Executive Summarysee moreWe evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz. Each experiment runs one iteration of forward pass and backward pass for training and forward pass only for inference. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio.Caveats
SW information:
HW information
Test command export LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so:${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libjemalloc.so
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"
export TORCHINDUCTOR_FREEZING=1
export OMP_NUM_THREADS=1
python benchmarks/dynamo/runner.py --enable_cpu_launcher --cpu_launcher_args "--core_list 0 --ncores_per_instance 1" --devices=cpu --dtypes=float32 --inference --compilers=inductor --batch_size=1 --threads 1 --extra-args="--timeout 9000"
To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks. Passrate
Geometric mean speedup
Mean compilation time (seconds)
Peak memory footprint compression ratio (higher is better)
torchbench suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
huggingface suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
timm_models suite with float32 precisionsee morePerformance speedup
Accuracy
Compilation latency (sec)
Peak Memory Compression Ratio
Absolute latency (ms)
|
Dashboard to track the performance of torchinductor on CPU.
cc @ezyang @msaroufim @wconstab @bdhirsh @anijain2305 @zou3519 @soumith @ngimel @chauhang
The text was updated successfully, but these errors were encountered: