v0.1.4

@oulgen

What's Changed

Update benchmark.yml by @oulgen in #570
Update benchmark.yml by @oulgen in #571
[Benchmark] Use custom kernel metric mappings list to accomodate for inconsistent namings by @oulgen in #567
Add rms norm and cross entropy by @oulgen in #568
Update benchmark_dispatch.yml by @oulgen in #573
Update linters by @oulgen in #569
Print config for PassManager::run triton errors by @jansel in #565
Error when invalid loop reduction number config is generated by @oulgen in #572
Add skipIfLowVRAM or use_default_config=False to specific unit tests to enable local testing by @yf225 in #574
Fix bug with block_size smaller than minimum by @jansel in #575
Better shape errors for mismatched tile sizes by @jansel in #566
Print warning if block_size is specified in interpret mode. by @choijon5 in #576
Run all shapes for benchmarks by @oulgen in #578
[Benchmarks] Cooldown the GPU before recording results by @oulgen in #579
[Benchmark] Fix layer_norm accuracy issue by @yf225 in #580
[Benchmark] Remove hardcoded num_inputs for rms_norm kernel by @yf225 in #581
Do not benchmark twice by @oulgen in #583
Add missing functions to docs by @jansel in #586
hl.atomic_add: support 1D tensor as index by @yf225 in #587
Add atomic and/or/min/max/cas/xchg by @jansel in #589
Add test shard with HELION_DEBUG_DTYPE_ASSERTS=1, only run one ref-eager shard by @jansel in #590
Add link to github to docs by @jansel in #591
Support layernorm without bias by @mengluy0125 in #585
Allow passing tritonbench operator instance into kernel benchmark wrapper; Always return lambda for timing measurement by @yf225 in #596
Add layer_norm backward kernels by @yf225 in #588
Fix tf32 warning by @jansel in #592
[Benchmark] geglu example and test by @Sibylau in #582
Print default config when running with it by @oulgen in #599
[Benchmark] swiglu example and test by @Sibylau in #584
Login to Docker from the workflows by @huydhn in #601
Add rms_norm backward kernels by @mengluy0125 in #597
Revert "Login to Docker from the workflows" by @oulgen in #604
Fix static shape typo by @oulgen in #609
Add small dim size (<16) support to hl.dot and torch.addmm; Always prefer using tl.dot(acc=...) for addmm / baddbmm by @yf225 in #564
Fix rms_norm and layer_norm by @mengluy0125 in #603
[Benchmark] jsd kernel and test by @Sibylau in #611
Refactor autotune error handling by @jansel in #595
Possible fix for CI failures by @jansel in #617
[Benchmark] Welford kernel and example by @karthickai in #614
[Benchmark] kl_div kernel and test by @Sibylau in #615
Ignore TServiceRouterException errors while autotuning by @jansel in #618
[Example] int4_gemm kernel example and tritonbench integration by @yf225 in #613
Set requires_grad=True for rms_norm backward inputs by @yf225 in #629
Adjust tolerance for test_rms_norm_bwd_dx by @yf225 in #628
Add more kernels to benchmarking by @oulgen in #632
Reorder benchmarks by @oulgen in #633
[Ref Mode] Fix hl.store for complex mask pattern by @yf225 in #621
Support using block size var outside of hl.tile loop by @yf225 in #619
[Benchmark CI] Print input shapes and surface problematic Helion config by @yf225 in #626
Fix ValueError: numel (2097152) exceeds triton maximum tensor numel (1048576) by @mengluy0125 in #625
Always clear inductor cache before benchmark by @yf225 in #608
Make hl.specialize work on sequences by @jansel in #636
Better error for passing Tile to hl.tile by @jansel in #640
[Example] grouped_gemm kernel example and tritonbench integration by @yf225 in #620
int4_gemm: remove use_default_config=True by @yf225 in #639
[Easy][Benchmark CI] Exit job on any exception, for easier error catching by @yf225 in #643
Avoid skipping CUDA errors that crashes the CUDA context by @yf225 in #645
Add HELION_AUTOTUNE_RANDOM_SEED env var and autotune_random_seed setting by @yf225 in #644
Bump linter by @oulgen in #647
Skip test_autotune_random_seed_from_env_var on rocm by @oulgen in #648
Fix lint related to welford and also local_cache by @yf225 in #646
Skip test_autotune_random_seed_from_settings on rocm by @yf225 in #651
PT Sphinx Theme Test by @sekyondaMeta in #600
Print static_shapes settings value along with config for accurate repro by @yf225 in #649
[Benchmark] gather_gemv kernel and test by @Sibylau in #635
Add HELION_SKIP_CACHE env by @jansel in #653
[lint] Remove UP038 reference by @jansel in #650
Fix register_block_size codegen by @yf225 in #659
Raise better error when hl.atomic_* is used on device tensor by @yf225 in #658
[Autotune] Filter bad config with accuracy check by @yf225 in #655
Add hl.rand op with seed arg lowering to tl.rand by @karthickai in #652
Log autotune random seed for easier repro by @yf225 in #661
Fix misaligned address error for matmul by @yf225 in #662
skip gather_gemv code check for B200 and fb_code by @Sibylau in #666
rms_norm: get weight from function args by @yf225 in #664
skip full autotune if configs are provided by @xuanzhang816 in #670
[example] fused_linear_jsd by @v0i0 in #494
Fix CI by moving B200 to cuda13 and downgrade a100/h100 to cuda12.8 by @oulgen in #674
No background image by @sekyondaMeta in #663
Remove github link from index.md by @oulgen in #675
[Autotune] Allow skipping Triton compilation error by @yf225 in #679
[Benchmark CI] Run one kernel per gpu to maximize successful kernel reporting by @yf225 in #681
Fix missing block size constexpr assignment in host code by @yf225 in #678
[CI] Fix missing setuptools by @yf225 in #680
faster rms norm backwards kernel by @v0i0 in #624
[Benchmark CI] Use do_bench cudagraph to avoid profiler failure; select specific kernel impls to run by @yf225 in #682
[Benchmark CI] use --op instead of --kernel for better tritonbench compat by @yf225 in #694
Increase tolerance for _validate_against_baseline by @jansel in #691
[Benchmark CI] Use equally-spaced K input shapes by @yf225 in #689
Print bad default config if compute baseline fails by @yf225 in #688
Support HELION_AUTOTUNE_ACCURACY_CHECK=0 by @jansel in #692
rms norm: improve fwd perf by @v0i0 in #669
Revert "Add hl.rand op with seed arg lowering to tl.rand (#652)" by @jansel in #698
[Autotune] Skip Triton shared memory OOM by @yf225 in #684
[Benchmark CI] Customizable num_inputs for each kernel by @yf225 in #699
Run autotune with TRITON_LOCAL_BUILD=1 by @jansel in #695
Add test for no options found in autotuner by @jansel in #693
[Benchmark CI] fix layer_norm mapping bug by @yf225 in #701
[Benchmark CI] Change DEFAULT_NUM_INPUTS to MAX_NUM_INPUTS by @yf225 in #702
[Benchmark CI] Allow customized mapping into tritonbench impls by @yf225 in #700
[Benchmark CI] Add --list-kernels-for-benchmark-ci to benchmark runner by @yf225 in #703
Increase test timeout to 60 by @jansel in #697
[Benchmark CI] Run rms_norm-bwd and layer_norm-bwd kernels by @yf225 in #705
Revert "[Benchmark CI] Run rms_norm-bwd and layer_norm-bwd kernels" by @yf225 in #706
Revert "[Benchmark CI] Add --list-kernels-for-benchmark-ci to benchmark runner" by @yf225 in #707
[Benchmark CI] Run rms_norm-bwd and layer_norm-bwd kernels by @yf225 in #708
[Benchmark CI] Run fewer inputs for layer_norm-bwd to avoid job timeout by @yf225 in #709
[Benchmark CI] Set tolerance values that match autotuner setting by @yf225 in #710
[Benchmark CI] Skip last input shape for rms_norm-bwd by @yf225 in #712
Rebenchmark configs to avoid noise by @jansel in #654
Increase default num_generations to 40 by @jansel in #677
Add PatternSearch autotuning algorithm by @jansel in #696
[Benchmark CI] Simplify h100 display name by @oulgen in #713

New Contributors

@choijon5 made their first contribution in #576
@mengluy0125 made their first contribution in #585
@huydhn made their first contribution in #601
@v0i0 made their first contribution in #494

Full Changelog: v0.1.3...v0.1.4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.1.4

What's Changed

New Contributors

Contributors

Uh oh!